val_bpb
1.1502
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.4MB
Training Techniques
Architecture
MLP3x
Widened MLP expansion to 3x for more capacity per layer.
parameters: {"mlp_mult":3,"hidden":1536}
tied embeddings
Uses tied input/output embeddings with FP16 export for the embedding/head.
parameters: {"vocab_size":1024}
KV head count
Uses grouped-query attention with 4 KV heads.
parameters: {"attention_heads":8,"kv_heads":4}
RoPE
Uses rotary positional embeddings in attention.
parameters: null
Quantization
STE QAT int6
bits: 6
scope: all block weights
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
AdamW
weight_decay: 0.04
momentum: null
other_params: null
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Regularization
weight decay
parameters: {"muon_weight_decay":0.04,"adam_weight_decay":0.04}
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"warmup_steps":1500}
Novel Contributions
- 11-layer transformer with 3x MLP expansion
- Int6 quantization-aware training with STE fake quantization
- Decoupled weight decay of 0.04 on both Muon and AdamW
- FP16 tied embedding export to preserve embedding/head quality
- zstd-22 compression to fit the larger model under the 16MB limit
- Sliding window evaluation with stride 64 for improved val_bpb
- Higher Muon momentum with warmup from 0.92 to 0.99