PR #511

open

Non Record: Add PPM heuristic for test time learning

by AnirudhRahulView on GitHub
val_bpb
1.1417
Architecture
Transformer
Optimizer
Artifact Size
15.6MB to 15.85MB

Training Techniques

Quantization
int5/int6
bits: null
scope: null
Architecture
BigramHash
BigramHash(10240) used as part of the model architecture
parameters: {"hash_size":10240}
Evaluation
sliding window eval
parameters: {"stride":32,"context_length":2048}
Test-Time Training
delayed outside-context-only PPM
parameters: {"delay":2048,"K":15,"k_values":[16,12,8,6],"min_confidence":[1,1,1,0.95],"min_count":[1,1,1,1],"bos_id":1}

Novel Contributions

  • Introduced delayed outside-context-only PPM heuristic at evaluation time to improve test-time performance without changing training architecture
  • PPM bank only uses targets from positions at least 2048 tokens behind current position, ensuring no overlap with transformer's local context window
  • Combination of transformer local context and delayed PPM for longer-range repeated-sequence signal
  • Demonstrated consistent val_bpb improvement across 3 seeds with small but statistically significant gains
  • Self-contained snapshot submission including train_gpt.py and trie_bench.c for delayed PPM implementation