val_bpb
1.0577
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.06MB
Training Techniques
Regularization
logit softcap
parameters: {"value":15}
weight decay
parameters: {"muon":0.012,"adam":0.012}
Quantization
int6
bits: 6
scope: all
late QAT
bits: 6
scope: all
Compression
zstd
level: null
Optimizer
Muon
weight_decay: 0.012
momentum: null
other_params: null
Adam
weight_decay: 0.012
momentum: null
other_params: {"role":"scalar/embed params"}
Architecture
weight tying
Tied input and output embeddings
parameters: null
GQA
Grouped query attention with fewer KV heads than attention heads
parameters: {"heads":8,"kv_heads":4}
ReLU²
Squared activation in the MLP
parameters: null
depthwise Conv1D
Local token mixing before transformer blocks
parameters: null
residual mixing
Learned mixing between current state and initial embedding with per-channel scaling
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_frac":0.35,"wallclock_aligned":true}
Novel Contributions
- P2 loss ((1-p)^2) for difficulty-aware training
- Wallclock-aware LR warmdown aligned to the 10-minute cap
- Residual mixing plus convolutional token mixing
- Muon optimizer for matrix parameters with Adam for scalar/embed parameters
- Compression-aware training with int6 quantization and late QAT