PR #953
openRecord: 1.0722 BPB — Improved TTT + HedgeMixer with Per-Layer LR Groups
by dexhunterView on GitHub
val_bpb
1.0722
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.66 MB
Training Techniques
Architecture
XSA
XSA applied across all 11 layers in the base architecture.
parameters: {"layers":11}
BigramHash
Bigram hash embedding with SmearGate in the context mixer.
parameters: {"size":6144,"dim":128}
SmearGate
Gating component paired with BigramHash.
parameters: null
Partial RoPE
Rotary positional encoding applied to only part of the head dimensions.
parameters: {"dimensions":"16/64"}
LeakyReLU
LeakyReLU squared activation used in the MLP.
parameters: {"squared":true,"alpha":0.5}
KV head count
Full multi-head attention with equal query and KV head counts.
parameters: {"heads":8,"kv_heads":8}
MLP3x
MLP expansion used in the base model.
parameters: {"expansion":3.5}
Quantization
GPTQ-lite
bits: 5
scope: all
Compression
zstd
level: null
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Regularization
LN scale
parameters: null
Test-Time Training
score-first TTT
parameters: {"epochs":4,"freeze_blocks":1,"learning_rate":0.0005,"chunk_tokens":32768}
LR Schedule
cosine decay
parameters: {"within_ttt":true}
Evaluation
sliding window eval
parameters: {"skipped":true}
Novel Contributions
- Per-layer learning-rate groups for TTT, with higher LR on output projections and lower LR on input projections
- Cosine learning-rate schedule within TTT to adapt aggressively early and anneal later
- Increased TTT to 4 epochs while freezing only 1 block
- Skipped standalone sliding window evaluation to reclaim eval budget for the extra TTT epoch
- Improved HedgeMixer + legal TTT stack over PR #720