PR #1521

open

Record: SP8192 + Muon 0.97 + 3-Layer Recurrence + Parallel Residuals + TTT — val_bpb 1.0802 (3-seed mean)

by aryanbhosaleView on GitHub
val_bpb
1.0802
Architecture
Transformer
Optimizer
Muon
Artifact Size

Training Techniques

Optimizer
Muon
weight_decay: 0.095
momentum: 0.97
other_params: null
Architecture
depth recurrence
3-layer depth recurrence in layers L3-L5
parameters: {"layers":3}
parallel residuals
Parallel residual connections in later layers
parameters: null
Quantization
GPTQ
bits: null
scope: embeddings
Weight Averaging
EMA
parameters: {"decay":0.9965}
LR Schedule
warmdown
parameters: {"warmdown":0.72}
Regularization
weight decay
parameters: {"value":0.095}
Test-Time Training
score-first TTT
parameters: {"epochs":3}
Sequence Length
sequence_length
train_length: 8192
eval_length: null
Other
other
SP8192 training variant
parameters: null
other
brotli compression used for the submission pipeline
parameters: null

Novel Contributions

  • Muon momentum reduced from 0.99 to 0.97 on the merged SOTA stack
  • 3-layer depth recurrence
  • parallel residuals
  • score-first test-time training
  • SP8192 training variant