PR #1297
closedRecord: Polar Express NS + SLOT + MuonEq-R + XSA-all — 1.1043 BPB (3-seed mean)
by OmrigotliebView on GitHub
val_bpb
1.1043
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.82 MB
Training Techniques
Architecture
XSA
Applied XSA to all transformer layers.
parameters: {"layers":11}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"backend_steps":4,"variant":"MuonEq-R"}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
SLOT
parameters: {"steps":8,"learning_rate":0.005}
LR Schedule
warmdown
parameters: {"warmdown_steps":4000}
Sequence Length
sequence_length
train_length: null
eval_length: 1024
Novel Contributions
- Polar Express Newton-Schulz with per-iteration minimax-optimal polynomials
- SLOT eval-time delta optimization with frozen weights
- MuonEq-R row-normalized gradient before Newton-Schulz orthogonalization
- XSA applied across all 11 layers
- Sliding-window evaluation with stride 64