val_bpb
0.9982
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.8 MB
Training Techniques
Architecture
Transformer
10-layer, 576-dimensional Transformer with increased capacity near the artifact size limit.
parameters: {"layers":10,"dimensions":576}
Quantization
mixed int5/int6
bits: null
scope: weights
QAT
bits: null
scope: weights
Test-Time Training
full TTT
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"schedule":"WSD"}
LR Schedule
warmup-stable-decay
parameters: null
Novel Contributions
- 10-layer 576-dimensional Transformer at the edge of the 16MB constraint
- Mixed-precision int5/int6 quantization with QAT
- Legal test-time training adaptation
- Muon optimizer with WSD scheduling