val_bpb
1.3043
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Architecture
wider-shallower Transformer
Uses a 4-layer, 768-dimensional model with grouped-query attention to improve performance at matched wallclock.
parameters: {"layers":4,"dimensions":768,"heads":12,"kv_heads":4}
KV head count
Grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":12,"kv_heads":4}
Quantization
QAT
bits: 8
scope: model weights
STE QAT
bits: 8
scope: model weights
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"lr":0.06,"grad_clip":0.5,"beta2":0.99}
Novel Contributions
- Wider-shallower 4x768 architecture with grouped-query attention
- Increased QK gain to sharpen attention
- Muon optimizer tuning with gradient clipping and beta2 adjustment
- Straight-through estimator quantization-aware training after warmup
- Reduced int8 quantization gap from about 0.03 to 0.0016 bpb
- Batch-size sweep on H100 to find 262K tokens optimal for single-GPU training