val_bpb
1.0854
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,438,323 B
Training Techniques
Architecture
GQA
11-layer Transformer with 8 attention heads and 4 KV heads
parameters: {"layers":11,"dim":512,"heads":8,"kv_heads":4}
weight tying
Tied input and output embeddings
parameters: null
RoPE
Rotary positional embeddings with 16 dimensions and base 10000
parameters: {"dimensions":16,"base":10000}
LeakyReLU
LeakyReLU-squared custom activation
parameters: null
XSA
XSA attention used on all 11 layers with Flash Attention 3
parameters: {"layers":11}
BigramHash
Bigram hash table side channel
parameters: {"dimensions":128,"vocab":2048}
VE128
Value embeddings on layers 9-10
parameters: {"layers":[9,10],"dimensions":128}
Regularization
logit softcap
parameters: {"value":30}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"newton_schulz_iterations":7,"muoneq_r":true}
Weight Averaging
SWA
parameters: null
EMA
parameters: null
Quantization
late QAT
bits: 6
scope: weights
Compression
brotli
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"steps":32,"learning_rate":0.04}
Novel Contributions
- MuonEq-R row-normalized momentum before Newton-Schulz orthogonalization
- Per-head QK gain initialized to 5.0
- More Newton-Schulz backend iterations (7 instead of 5)
- SLOT32 test-time adaptation with lr=0.04
- Sliding-window evaluation with SLOT adaptation
- Late QAT with fake int6 quantization
- Bigram hash side channel and value embeddings