PR #1298
openRecord: Polar Express NS + SLOT + MuonEq-R + XSA-all — 1.1043 BPB (3-seed mean)
by OmrigotliebView on GitHub
val_bpb
1.1043
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.82 MB
Training Techniques
Architecture
BigramHash
Bigram vocabulary embedding/hash component used in the model.
parameters: {"vocab_size":1536}
XSA
Exclusive self-attention applied across all layers.
parameters: {"layers":11}
Partial RoPE
Rotary positional embeddings applied to a subset of dimensions.
parameters: {"dimensions":16}
MLP3x
Three-layer MLP stack with LeakyReLU squared activation.
parameters: {"activation":"LeakyReLU²"}
VE128
Value residual enhancement on selected layers.
parameters: {"dimensions":128,"layers":[9,10]}
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500,"muon_backend_steps":4}
Muon
weight_decay: null
momentum: null
other_params: {"polar_express_steps":4,"muon_eq_r":true}
Weight Averaging
EMA + SWA
parameters: {"decay":0.997,"every":50}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
SLOT
parameters: {"steps":8,"learning_rate":0.005}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
lzma
level: 9
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Novel Contributions
- Polar Express Newton-Schulz with per-iteration minimax-optimal polynomials
- SLOT eval-time delta optimization
- MuonEq-R row-normalized gradient reparameterization
- XSA extended to all 11 layers
- 3-seed mean validation result of 1.1043 BPB