PR #1344
openRecord: SP4096 + Polar Express + MuonEq-R + Depth Recurrence — 1.0923 BPB (3-seed)
by OmrigotliebView on GitHub
val_bpb
1.0923
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.69 MB
Training Techniques
Optimizer
Muon
weight_decay: 0.105
momentum: null
other_params: {"backend_steps":4}
Architecture
depth recurrence
Shared MLP weights across recurrent depth layers to create virtual layers from fewer physical layers.
parameters: {"layers":[3,4,5],"virtual_layers":14,"physical_layers":11}
Regularization
weight decay
parameters: {"embed_wd":0.105,"matrix_wd":0.105}
Other
other
Polar Express Newton-Schulz orthogonalization with 4 minimax-optimal steps.
parameters: {"steps":4}
other
MuonEq-R: row-normalize gradient before Newton-Schulz orthogonalization.
parameters: null
Novel Contributions
- Polar Express Newton-Schulz with 4 minimax-optimal steps
- MuonEq-R row-normalized gradient before Newton-Schulz orthogonalization
- Depth recurrence with shared MLP weights
- Higher weight decay for quantization-friendly compression
- Tuned matrix learning rate to recover quality