PR #1523

open

Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean)

by EthanYangTWView on GitHub
val_bpb
1.0778
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.99 MB

Training Techniques

Architecture
depth recurrence
Triple depth recurrence with 17 virtual layers from 11 physical layers; layers 3, 4, and 5 looped multiple times and enabled partway through training.
parameters: {"physical_layers":11,"virtual_layers":17,"loop_layers":[3,4,5],"activation_start":35}
BigramHash
Eval-time hash embedding using a bigram hash over prefix tokens, with a zero-initialized learned embedding trained during TTT.
parameters: {"vocab_size":16384,"embedding_dim":512}
LeakyReLU
Fused MLP uses LeakyReLU with squared activation in the MLP path.
parameters: {"negative_slope":0.5}
Optimizer
Parallel Muon
weight_decay: 0.095
momentum: 0.97
other_params: {"lr":0.022}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
GPTQ
bits: 6
scope: weights
int8
bits: 8
scope: embeddings
Compression
brotli
level: null
Evaluation
sliding window eval
parameters: null
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.01}
LR Schedule
warmdown
parameters: {"warmdown_steps":66.7}
Regularization
logit softcap
parameters: {"value":30}

Novel Contributions

  • Parameter banking with parallel Muon across 4 contiguous banks
  • Batched Newton-Schulz optimizer step for faster training
  • Fused MLP Triton TMA kernel combining fc, LeakyReLU, and square
  • Muon momentum reduced to 0.97
  • Triple depth recurrence with 17 virtual layers
  • Eval-time BigramHash embedding trained during TTT
  • TTT learning rate tuned to 0.01
  • Score-first TTT compliance