PR #287
RECORDclosedRecord: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271)
by jfprinczView on GitHub
val_bpb
1.1271
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.5 MB
Training Techniques
Architecture
XSA
Exclusive Self Attention applied to the last 4 layers; subtracts the component aligned with each token's own value vector from attention output.
parameters: {"layers":4}
MLP3x
Three-times wider MLP blocks with hidden size 1536 and relu² activation.
parameters: {"hidden_size":1536}
SmearGate
Learned token blending gate used in the model.
parameters: null
BigramHash
Bigram hash embedding with 2048 buckets, dimension 128, projected to 512.
parameters: {"vocab_size":2048,"dimension":128,"projection_dim":512}
RoPE
NTK-aware rotary positional embeddings.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
mixed int6/int8
bits: 6
scope: MLP and attention int6, embeddings int8
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
OrthoInit
Orthogonal initialization with muP scaling on large matrices.
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"warmup_start":0.92,"warmup_steps":1500,"warmdown_iters":3000,"grad_clip":0.3}
Regularization
weight decay
parameters: {"value":0.04}
LR Schedule
warmup + warmdown
parameters: {"warmup_steps":1500,"warmdown_steps":3000}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Novel Contributions
- Exclusive Self Attention (XSA) on the last 4 layers
- EMA replacing SWA for weight averaging
- Mixed int6/int8 quantization with zstd-22 compression
- 11-layer Transformer stack with U-Net skip connections and 3x MLP blocks
- OrthoInit with muP scaling and tuned Muon optimizer settings