PR #287

RECORDclosed

Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271)

by jfprinczView on GitHub

val_bpb

1.1271

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.5 MB

Training Techniques

Architecture

XSA

Exclusive Self Attention applied to the last 4 layers; subtracts the component aligned with each token's own value vector from attention output.

parameters: {"layers":4}

MLP3x

Three-times wider MLP blocks with hidden size 1536 and relu² activation.

parameters: {"hidden_size":1536}

SmearGate

Learned token blending gate used in the model.

parameters: null

BigramHash

Bigram hash embedding with 2048 buckets, dimension 128, projected to 512.

parameters: {"vocab_size":2048,"dimension":128,"projection_dim":512}

RoPE

NTK-aware rotary positional embeddings.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

mixed int6/int8

bits: 6

scope: MLP and attention int6, embeddings int8

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Initialization

OrthoInit

Orthogonal initialization with muP scaling on large matrices.

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"warmup_start":0.92,"warmup_steps":1500,"warmdown_iters":3000,"grad_clip":0.3}

Regularization

weight decay

parameters: {"value":0.04}

LR Schedule

warmup + warmdown

parameters: {"warmup_steps":1500,"warmdown_steps":3000}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Novel Contributions

Exclusive Self Attention (XSA) on the last 4 layers
EMA replacing SWA for weight averaging
Mixed int6/int8 quantization with zstd-22 compression
11-layer Transformer stack with U-Net skip connections and 3x MLP blocks
OrthoInit with muP scaling and tuned Muon optimizer settings