PR #390

closed

Record: Sponge Bath — TTT 8ep eval-only improvement (val_bpb: 1.1295)

by newjordanView on GitHub
val_bpb
1.1295
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.74 MB

Training Techniques

Quantization
int6 QAT
bits: 6
scope: all
Architecture
SmearGate
Uses SmearGate in the MLP stack as part of the base architecture.
parameters: null
BigramHash
Uses BigramHash with 2048 buckets as part of the base architecture.
parameters: {"buckets":2048}
MLP3x
3x MLP expansion.
parameters: {"expansion":3}
KV head count
Grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
tied embeddings
Input and output embeddings are tied.
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.025}
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: {"stride":32}
Test-Time Training
full TTT
parameters: {"epochs":8,"learning_rate":0.002,"momentum":0.9}
Initialization
OrthoInit
Orthogonal initialization.

Novel Contributions

  • Increased test-time training from 3 to 8 epochs
  • Reduced evaluation stride from 64 to 32
  • Pure eval-time improvement with no architecture or training changes
  • Achieved a new record validation bpb of 1.1295