PR #1332
openRecord: SP4096 + Polar Express NS + MuonEq-R + WD=0.090 — 1.0959 BPB (3-seed mean)
by OmrigotliebView on GitHub
val_bpb
1.0959
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.97 MB
Training Techniques
Architecture
XSA
Applied XSA across all 11 layers with no new parameters.
parameters: {"layers":11}
Optimizer
Muon
weight_decay: 0.09
momentum: null
other_params: {"MuonEq-R":true,"row_normalize_gradient_before_ns":true}
Regularization
weight decay
parameters: {"weight_decay":0.09}
Novel Contributions
- Polar Express Newton-Schulz with per-iteration minimax polynomials and 4 steps instead of 5
- MuonEq-R row-normalization of gradients before Newton-Schulz orthogonalization
- Higher weight decay (0.090) for quantization-friendly compression
- XSA applied to all 11 layers with zero new parameters
- 3-seed mean validation result of 1.0959 BPB