PR #442

closed

Record: 11L EMA + AdamW TTT 10ep (mean val_bpb=1.1027)

val_bpb
1.1027
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~15.75 MB

Training Techniques

Architecture
SmearGate
Adds SmearGate to the model.
parameters: null
BigramHash
Uses BigramHash for additional token interaction features.
parameters: {"vocab_size":2048,"dim":128}
RoPE
Uses partial rotary positional embeddings.
parameters: {"dimensions":16,"base_dimensions":64}
MLP3x
Uses a 3x-width MLP block.
parameters: {"hidden":1536}
KV head count
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
weight tying
Uses tied embeddings.
parameters: null
Initialization
OrthoInit
Orthogonal initialization.
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
AdamW
weight_decay: 0
momentum: null
other_params: {"learning_rate":0.0005}
Test-Time Training
full TTT
parameters: {"learning_rate":0.0005,"epochs":10,"optimizer":"AdamW"}
Evaluation
sliding window eval
parameters: {"stride":64}
Quantization
int6
bits: 6
scope: mixed
Compression
zstd
level: 22
Regularization
layerwise LN scale
parameters: null

Novel Contributions

  • Replaced SGD with AdamW for test-time training
  • Reduced TTT epochs from 20 to 10 while improving validation BPB
  • Achieved a new record mean val_bpb of 1.1027
  • Reduced TTT runtime from about 260s to about 157s
  • Used the same 11-layer EMA-based setup as PR #398 with only a small optimizer change