PR #1533

open

Record: SP8192 + Banking + Triple Recurrence + Parallel Residuals + Muon 0.97 + TTT — val_bpb 1.0790 (5-seed mean)

by aryanbhosaleView on GitHub
val_bpb
1.0790
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB

Training Techniques

Quantization
GPTQ
bits: null
scope: embeddings
Architecture
depth recurrence
Triple depth recurrence with 17 virtual layers from 11 physical layers, applied at L3-5.
parameters: {"layers":17}
weight tying
Hash embedding removed; standard MLP used instead of Triton fused kernel.
parameters: null
U-Net skip connections
Parallel residual connections used from L7+ in a GPT-J style arrangement.
parameters: null
Optimizer
Muon
weight_decay: 0.095
momentum: 0.97
other_params: {"qk_gain":5.25}
Weight Averaging
EMA
parameters: {"decay":0.9965}
LR Schedule
warmdown
parameters: {"warmdown":0.72}
Test-Time Training
score-first TTT
parameters: {"epochs":3,"learning_rate":0.005}
Regularization
weight decay
parameters: {"value":0.095}

Novel Contributions

  • SP8192 vocabulary with GPTQ embeddings and SDClip quantization
  • Parameter Banking using a batched Newton-Schulz optimizer step
  • Triple depth recurrence with 17 virtual layers
  • Parallel residuals in later layers
  • Muon 0.97 optimizer configuration
  • Score-first test-time training framework
  • Record 5-seed mean validation score of 1.0790 bpb