PR #1521
openRecord: SP8192 + Muon 0.97 + 3-Layer Recurrence + Parallel Residuals + TTT — val_bpb 1.0802 (3-seed mean)
by aryanbhosaleView on GitHub
val_bpb
1.0802
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Optimizer
Muon
weight_decay: 0.095
momentum: 0.97
other_params: null
Architecture
depth recurrence
3-layer depth recurrence in layers L3-L5
parameters: {"layers":3}
parallel residuals
Parallel residual connections in later layers
parameters: null
Quantization
GPTQ
bits: null
scope: embeddings
Weight Averaging
EMA
parameters: {"decay":0.9965}
LR Schedule
warmdown
parameters: {"warmdown":0.72}
Regularization
weight decay
parameters: {"value":0.095}
Test-Time Training
score-first TTT
parameters: {"epochs":3}
Sequence Length
sequence_length
train_length: 8192
eval_length: null
Other
other
SP8192 training variant
parameters: null
other
brotli compression used for the submission pipeline
parameters: null
Novel Contributions
- Muon momentum reduced from 0.99 to 0.97 on the merged SOTA stack
- 3-layer depth recurrence
- parallel residuals
- score-first test-time training
- SP8192 training variant