PR #967

open

Record: 1.0450 BPB — SGD TTT + HedgeMixer with Per-Layer LR Groups

by dexhunterView on GitHub
val_bpb
1.0450
Architecture
Transformer
Optimizer
SGD
Artifact Size
15.67MB

Training Techniques

Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"optimizer":"SGD","momentum":0.9}
full TTT
parameters: {"epochs":4,"zero_frozen_blocks":true,"skip_sliding_eval":true}
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"learning_rate":0.002}
Architecture
BigramHash
Bigram hash feature module used in the base architecture.
parameters: {"size":6144,"dim":128}
SmearGate
Gating component paired with BigramHash in the base architecture.
parameters: null
XSA
XSA applied across all layers in the inherited architecture.
parameters: {"layers":11}
Partial RoPE
Rotary positional embeddings applied to only part of the head dimensions.
parameters: {"dimensions":"16/64"}
LeakyReLU
LeakyReLU squared activation used in the MLP.
parameters: {"squared":true,"negative_slope":0.5}
KV head count
Uses 8 KV heads with full multi-head attention.
parameters: {"kv_heads":8}
MLP3x
MLP expansion in the base architecture.
parameters: {"expansion":3.5}
weight tying
Tied embeddings are implied by the canonical method vocabulary only if explicitly mentioned; not clearly stated here.
parameters: null
Quantization
GPTQ-lite
bits: 5
scope: base model
Compression
zstd
level: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Regularization
LN scale
parameters: null
LR Schedule
cosine decay
parameters: {"within_ttt":true}
Other
other
Per-layer learning-rate groups for TTT, with higher LR for output projections and lower LR for input projections.
parameters: {"output_projections_lr_multiplier":3,"input_projections_lr_multiplier":0.5}
other
HedgeMixer with backward-looking experts over scored tokens.
parameters: {"experts":["Neural","Unigram","Bigram","Trigram","Entropy"]}

Novel Contributions

  • Switched TTT from AdamW to SGD with momentum for a large BPB improvement
  • Added per-layer TTT learning-rate groups
  • Used cosine LR decay within TTT
  • Combined SGD TTT with HedgeMixer for the best reported score
  • Verified the method with a 3-seed evaluation and ablations