PR #1331
openRecord: MuonEq-R + 3-Layer Recurrence + WD=0.095 + MLR=0.022 + All-Int6 — val_bpb 1.0900 (3-seed mean)
by dexhunterView on GitHub
val_bpb
1.0900
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.96 MB
Training Techniques
Architecture
depth recurrence
Repeats layers 3, 4, and 5 to create virtual layers and improve compression/quality tradeoff.
parameters: {"layers":[3,4,5]}
Quantization
int6
bits: 6
scope: all
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"matrix_lr":0.022}
Regularization
weight decay
parameters: {"weight_decay":0.095}
Novel Contributions
- 3-layer depth recurrence using layers 3, 4, and 5
- Weight decay and matrix learning-rate synergy to recover quality while fitting within the artifact budget
- All-int6 quantization across all layers while staying under the 16MB limit
- Record-setting 3-seed mean validation BPB of 1.0900