PR #1297

closed

Record: Polar Express NS + SLOT + MuonEq-R + XSA-all — 1.1043 BPB (3-seed mean)

by OmrigotliebView on GitHub
val_bpb
1.1043
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.82 MB

Training Techniques

Architecture
XSA
Applied XSA to all transformer layers.
parameters: {"layers":11}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"backend_steps":4,"variant":"MuonEq-R"}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
SLOT
parameters: {"steps":8,"learning_rate":0.005}
LR Schedule
warmdown
parameters: {"warmdown_steps":4000}
Sequence Length
sequence_length
train_length: null
eval_length: 1024

Novel Contributions

  • Polar Express Newton-Schulz with per-iteration minimax-optimal polynomials
  • SLOT eval-time delta optimization with frozen weights
  • MuonEq-R row-normalized gradient before Newton-Schulz orthogonalization
  • XSA applied across all 11 layers
  • Sliding-window evaluation with stride 64