PR #1605

closed

Score-First TTT + Causal N-gram (order=82) — val_bpb 0.29882 (3-seed mean)

by renqianluoView on GitHub
val_bpb
0.2988
Architecture
Transformer
Optimizer
SGD
Artifact Size
≤16MB

Training Techniques

Test-Time Training
score-first TTT
parameters: {"epochs":1,"learning_rate":0.005,"optimizer":"SGD"}
Optimizer
SGD
weight_decay: null
momentum: null
other_params: {"learning_rate":0.005}
Architecture
BigramHash
Causal backoff n-gram mixer built during evaluation with high-order context memory and full_c_fix gating.
parameters: {"order":82,"buckets":4194304,"full_c_fix":1}
Evaluation
stride-based eval
parameters: {"stride":96}

Novel Contributions

  • Score-first test-time training that scores each chunk before updating weights
  • Causal backoff n-gram mixer with order 82
  • Entropy-adaptive blending between neural and n-gram predictions
  • full_c_fix to avoid predictions for unseen contexts
  • Aggressive n-gram blending centered at entropy 1.0 bits