val_bpb
1.1388
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.85 MB
Training Techniques
Quantization
int6 QAT
bits: 6
scope: all
Architecture
MLP3x
Expanded MLP capacity to 3x size using space saved by int6 quantization.
parameters: null
SmearGate
Adds a complementary bigram-context signal at the embedding layer.
parameters: null
BigramHash
Adds a bigram-context hashing signal at the embedding layer.
parameters: null
Initialization
Orthogonal init
Orthogonal weight initialization to accelerate early convergence.
Regularization
weight decay
parameters: {"weight_decay":0.04}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"decoupled_weight_decay":true}
Weight Averaging
SWA
parameters: {"interval_steps":50,"start_fraction":0.5}
Evaluation
sliding window eval
parameters: {"stride":64}
Novel Contributions
- Int6 QAT with STE enabled from 30% of training onward to reduce post-training quantization penalty
- 3x MLP expansion funded by the byte savings from int6 quantization
- SmearGate and BigramHash as complementary bigram-context signals at the embedding layer
- Orthogonal initialization and output-projection scaling for faster early convergence
- Muon optimizer with decoupled weight decay of 0.04 to improve quantization quality
- SWA applied at 50-step intervals over the last 50% of training
- Sliding-window evaluation with stride 64