PR #511

open

Non Record: Add PPM heuristic for test time learning

by AnirudhRahulView on GitHub

val_bpb

1.1417

Architecture

Transformer

Optimizer

—

Artifact Size

15.6MB to 15.85MB

Training Techniques

Quantization

int5/int6

bits: null

scope: null

Architecture

BigramHash

BigramHash(10240) used as part of the model architecture

parameters: {"hash_size":10240}

Evaluation

sliding window eval

parameters: {"stride":32,"context_length":2048}

Test-Time Training

delayed outside-context-only PPM

parameters: {"delay":2048,"K":15,"k_values":[16,12,8,6],"min_confidence":[1,1,1,0.95],"min_count":[1,1,1,1],"bos_id":1}

Introduced delayed outside-context-only PPM heuristic at evaluation time to improve test-time performance without changing training architecture
PPM bank only uses targets from positions at least 2048 tokens behind current position, ensuring no overlap with transformer's local context window
Combination of transformer local context and delayed PPM for longer-range repeated-sequence signal
Demonstrated consistent val_bpb improvement across 3 seeds with small but statistically significant gains
Self-contained snapshot submission including train_gpt.py and trie_bench.c for delayed PPM implementation