val_bpb
1.1200
Architecture
Transformer
Optimizer
—
Artifact Size
—
Training Techniques
Quantization
QAT
bits: 6
scope: all
mixed int4/int6
bits: null
scope: MLP and attention
Regularization
weight decay
parameters: null
Other
other
Frobenius-norm penalty applied every 50 steps to encourage low-rank structure for better downstream compression
parameters: {"interval_steps":50,"type":"Frobenius-norm penalty"}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"variant":"NuMuon-lite"}
Novel Contributions
- QAT fused into the cooldown phase instead of applying GPTQ only after training
- Mixed precision with INT4 MLP weights and INT6 attention weights
- NuMuon-lite Frobenius-norm regularization to encourage low-rank structure and improve GPTQ+Brotli compression