val_bpb
1.1417
Architecture
Transformer
Optimizer
—
Artifact Size
15.6MB to 15.85MB
Training Techniques
Quantization
int5/int6
bits: null
scope: null
Architecture
BigramHash
BigramHash(10240) used as part of the model architecture
parameters: {"hash_size":10240}
Evaluation
sliding window eval
parameters: {"stride":32,"context_length":2048}
Test-Time Training
delayed outside-context-only PPM
parameters: {"delay":2048,"K":15,"k_values":[16,12,8,6],"min_confidence":[1,1,1,0.95],"min_count":[1,1,1,1],"bos_id":1}
Novel Contributions
- Introduced delayed outside-context-only PPM heuristic at evaluation time to improve test-time performance without changing training architecture
- PPM bank only uses targets from positions at least 2048 tokens behind current position, ensuring no overlap with transformer's local context window
- Combination of transformer local context and delayed PPM for longer-range repeated-sequence signal
- Demonstrated consistent val_bpb improvement across 3 seeds with small but statistically significant gains
- Self-contained snapshot submission including train_gpt.py and trie_bench.c for delayed PPM implementation