PR #1831

open

Non-record: Polar Express NS Coefficient Ablation on #1809 (val_bpb 1.08154)

by Christopher-Lee-McClendonView on GitHub
val_bpb
1.0815
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,974,228 bytes

Training Techniques

Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"Newton-Schulz_steps":5}
Architecture
depth recurrence
3-layer recurrence stack architecture used in the baseline submission.
parameters: {"layers":3}
Quantization
QAT
bits: 5
scope: model weights
Evaluation
sliding window eval
parameters: null
Test-Time Training
full TTT
parameters: null
Weight Averaging
EMA
parameters: null
Compression
zstd
level: null

Novel Contributions

  • Ablation study comparing Polar Express per-iteration Newton-Schulz coefficients against fixed coefficients on PR #1809's architecture
  • Found that Polar Express slightly worsened validation BPB relative to fixed Newton-Schulz coefficients
  • Reported consistent degradation across TTT, sliding window, quantized-only, and pre-quant evaluation modes
  • Demonstrated that the fixed coefficients (3.4445, -4.775, 2.0315) were better for this specific 5-step Newton-Schulz setup