PR #1533
openRecord: SP8192 + Banking + Triple Recurrence + Parallel Residuals + Muon 0.97 + TTT — val_bpb 1.0790 (5-seed mean)
by aryanbhosaleView on GitHub
val_bpb
1.0790
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB
Training Techniques
Quantization
GPTQ
bits: null
scope: embeddings
Architecture
depth recurrence
Triple depth recurrence with 17 virtual layers from 11 physical layers, applied at L3-5.
parameters: {"layers":17}
weight tying
Hash embedding removed; standard MLP used instead of Triton fused kernel.
parameters: null
U-Net skip connections
Parallel residual connections used from L7+ in a GPT-J style arrangement.
parameters: null
Optimizer
Muon
weight_decay: 0.095
momentum: 0.97
other_params: {"qk_gain":5.25}
Weight Averaging
EMA
parameters: {"decay":0.9965}
LR Schedule
warmdown
parameters: {"warmdown":0.72}
Test-Time Training
score-first TTT
parameters: {"epochs":3,"learning_rate":0.005}
Regularization
weight decay
parameters: {"value":0.095}
Novel Contributions
- SP8192 vocabulary with GPTQ embeddings and SDClip quantization
- Parameter Banking using a batched Newton-Schulz optimizer step
- Triple depth recurrence with 17 virtual layers
- Parallel residuals in later layers
- Muon 0.97 optimizer configuration
- Score-first test-time training framework
- Record 5-seed mean validation score of 1.0790 bpb