PR #1831
openNon-record: Polar Express NS Coefficient Ablation on #1809 (val_bpb 1.08154)
by Christopher-Lee-McClendonView on GitHub
val_bpb
1.0815
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,974,228 bytes
Training Techniques
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"Newton-Schulz_steps":5}
Architecture
depth recurrence
3-layer recurrence stack architecture used in the baseline submission.
parameters: {"layers":3}
Quantization
QAT
bits: 5
scope: model weights
Evaluation
sliding window eval
parameters: null
Test-Time Training
full TTT
parameters: null
Weight Averaging
EMA
parameters: null
Compression
zstd
level: null
Novel Contributions
- Ablation study comparing Polar Express per-iteration Newton-Schulz coefficients against fixed coefficients on PR #1809's architecture
- Found that Polar Express slightly worsened validation BPB relative to fixed Newton-Schulz coefficients
- Reported consistent degradation across TTT, sliding window, quantized-only, and pre-quant evaluation modes
- Demonstrated that the fixed coefficients (3.4445, -4.775, 2.0315) were better for this specific 5-step Newton-Schulz setup