PR #1037

closed

Record: Muon TTT + Entropy-Adaptive Epochs — val_bpb 1.1179 (3-seed mean)

by TimPietruskyRunPodView on GitHub
val_bpb
1.1179
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.95MB

Training Techniques

Test-Time Training
score-first TTT
parameters: {"chunk_size":32000,"stride":64,"all_blocks_unfrozen":true}
Architecture
GQA
Grouped query attention with 8 attention heads and 4 KV heads
parameters: {"heads":8,"kv_heads":4}
XSA
XSA applied to the last 4 layers
parameters: {"layers":4}
Partial RoPE
Partial rotary positional embeddings
parameters: {"dimensions":16}
BigramHash
Bigram hash embeddings
parameters: null
SmearGate
SmearGate gating mechanism
parameters: null
VE128
Value embeddings on layers 9-10
parameters: {"layers":[9,10],"dimensions":128}
LeakyReLU
MLP uses LeakyReLU(0.5)^2 activation
parameters: {"negative_slope":0.5}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_interval_steps":50}
Quantization
GPTQ
bits: 6
scope: all
QAT
bits: null
scope: all
Compression
lzma
level: null
Regularization
magnitude pruning
parameters: {"sparsity":0.04}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: null

Novel Contributions

  • Muon-style score-first test-time training
  • Entropy-adaptive epoch selection for harder and easier chunks
  • Combined Muon TTT with a compact Transformer architecture stack
  • Int6 GPTQ compression with Hessian error compensation and LZMA
  • 3-seed validated record submission with sub-1.118 mean val_bpb