PR #999

open

Record: 11L Muon TTT + Entropy-Adaptive Epochs (8×H100) — val_bpb 1.1179 (3-seed mean)

by aamodbhattView on GitHub
val_bpb
1.1179
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.9 MB

Training Techniques

Architecture
BigramHash
Bigram hash embedding component in the base stack.
parameters: {"size":1536}
XSA
Uses XSA on the last layers of the model.
parameters: {"last_n":4}
MLP3x
Three-times expanded MLP block.
parameters: null
LeakyReLU
LeakyReLU^2 activation in the MLP.
parameters: {"slope":0.5}
RoPE
Partial rotary positional embeddings.
parameters: {"dimensions":16}
VE128
Value residual enhancement on selected layers.
parameters: {"layers":[9,10]}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Quantization
late QAT
bits: 6
scope: model
Compression
lzma
level: 7
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"epochs":"2/3/4 adaptive","chunk_tokens":32768}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"ttt_muon":true,"newton_schulz_steps":3,"parallel":true}
LR Schedule
cosine decay
parameters: {"warmdown_steps":3500}
Other
other
Entropy-adaptive TTT epoch selection based on chunk uncertainty, assigning 2/3/4 epochs per chunk.
parameters: {"high_threshold":2.1,"low_threshold":1.75}

Novel Contributions

  • Muon-style Newton-Schulz orthogonalized updates in the test-time training loop
  • Entropy-adaptive epoch selection that allocates 2/3/4 epochs per chunk based on chunk uncertainty
  • Score-first TTT with global NLL synchronization across DDP ranks to avoid collective mismatch
  • Improved 3-seed mean val_bpb to 1.1179, beating the prior SOTA of 1.1194