PR #1344

open

Record: SP4096 + Polar Express + MuonEq-R + Depth Recurrence — 1.0923 BPB (3-seed)

by OmrigotliebView on GitHub
val_bpb
1.0923
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.69 MB

Training Techniques

Optimizer
Muon
weight_decay: 0.105
momentum: null
other_params: {"backend_steps":4}
Architecture
depth recurrence
Shared MLP weights across recurrent depth layers to create virtual layers from fewer physical layers.
parameters: {"layers":[3,4,5],"virtual_layers":14,"physical_layers":11}
Regularization
weight decay
parameters: {"embed_wd":0.105,"matrix_wd":0.105}
Other
other
Polar Express Newton-Schulz orthogonalization with 4 minimax-optimal steps.
parameters: {"steps":4}
other
MuonEq-R: row-normalize gradient before Newton-Schulz orthogonalization.
parameters: null

Novel Contributions

  • Polar Express Newton-Schulz with 4 minimax-optimal steps
  • MuonEq-R row-normalized gradient before Newton-Schulz orthogonalization
  • Depth recurrence with shared MLP weights
  • Higher weight decay for quantization-friendly compression
  • Tuned matrix learning rate to recover quality