PR #995
openRecord: 1.0362 BPB — SGD Momentum 0.95 TTT + HedgeMixer + Per-Layer LR
by dexhunterView on GitHub
val_bpb
1.0362
Architecture
Transformer
Optimizer
SGD
Artifact Size
15.67MB
Training Techniques
Optimizer
SGD
weight_decay: null
momentum: 0.95
other_params: {"learning_rate":0.002}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"momentum":0.95,"epochs":4,"freeze_depth":0}
Architecture
BigramHash
Hash-based token embedding component in the base architecture
parameters: {"size":6144}
Partial RoPE
Partial rotary positional encoding
parameters: null
XSA
XSA-all attention variant used in the base model
parameters: {"layers":11}
LeakyReLU
LeakyReLU activation in the MLP
parameters: {"slope":0.5}
MLP3x
Expanded MLP width
parameters: {"multiplier":3.5}
KV head count
Reduced KV head count relative to attention heads
parameters: {"heads":8,"kv_heads":8}
LogisticContextMixer
Backward-looking HedgeMixer with multiple experts
parameters: {"experts":5}
Quantization
GPTQ-lite
bits: 5
scope: model
Compression
zstd
level: null
LR Schedule
cosine decay
parameters: null
Sequence Length
sequence_length
train_length: 32000
eval_length: null
Regularization
weight decay
parameters: null
Novel Contributions
- Switched TTT optimization from AdamW to SGD with momentum 0.95
- Introduced per-layer learning-rate groups with higher LR for output projections and lower LR for input layers
- Validated a best configuration using multi-seed sweeps and ablations
- Combined score-first legal TTT with backward-looking HedgeMixer
- Achieved a new record mean validation BPB of 1.0362