val_bpb
1.8698
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.7MB
Training Techniques
Architecture
depth recurrence
3 unique transformer blocks are repeated 3 times for an effective depth of 9, reusing blocks across repeats without U-Net skip connections.
parameters: {"unique_blocks":3,"repeats":3,"effective_depth":9,"dim":1024}
tied embeddings
Input and output embeddings are tied to reduce parameters.
parameters: null
KV head count
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":24,"kv_heads":12}
Quantization
QAT
bits: 6
scope: all
Optimizer
Muon
weight_decay: null
momentum: 0.85
other_params: {"matrix_lr":0.02,"muon_backend_steps":7,"qk_gain_init":2,"qk_gain":2}
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Other
other
NorMuon training variant used alongside Int6 QAT.
parameters: null
Novel Contributions
- Depth recurrence with 3 unique transformer blocks repeated 3 times
- Trading architectural diversity for width to fit a larger dimension within the parameter budget
- Int6 QAT to increase parameter capacity within the 16MB artifact budget
- Use of NorMuon, which reportedly improved BPB
- Sliding window evaluation with stride 64
- Systematic search over multiple architectural strategies and hyperparameters