PR #1332

open

Record: SP4096 + Polar Express NS + MuonEq-R + WD=0.090 — 1.0959 BPB (3-seed mean)

by OmrigotliebView on GitHub
val_bpb
1.0959
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.97 MB

Training Techniques

Architecture
XSA
Applied XSA across all 11 layers with no new parameters.
parameters: {"layers":11}
Optimizer
Muon
weight_decay: 0.09
momentum: null
other_params: {"MuonEq-R":true,"row_normalize_gradient_before_ns":true}
Regularization
weight decay
parameters: {"weight_decay":0.09}

Novel Contributions

  • Polar Express Newton-Schulz with per-iteration minimax polynomials and 4 steps instead of 5
  • MuonEq-R row-normalization of gradients before Newton-Schulz orthogonalization
  • Higher weight decay (0.090) for quantization-friendly compression
  • XSA applied to all 11 layers with zero new parameters
  • 3-seed mean validation result of 1.0959 BPB