val_bpb
0.3779
Architecture
Transformer
Optimizer
—
Artifact Size
16MB
Training Techniques
Evaluation
sliding window eval
parameters: {"stride":64}
Architecture
BigramHash
Hash-table based n-gram caching/mixing built from already-scored tokens during evaluation.
parameters: {"orders":"2-9","buckets":4000000}
Test-Time Training
score-first TTT
parameters: null
LoRA TTT
parameters: {"epochs":8}
Quantization
int6
bits: 6
scope: all
Novel Contributions
- Eval-time n-gram caching/mixing that preserves strict causality by using only already-scored tokens
- Demonstration that a pure n-gram cache can outperform the neural base model on FineWeb validation
- Finding that smaller hash tables with more collisions can improve BPB because collisions help counts cross the min_count threshold
- Global all-reduce synchronization of n-gram hash table deltas across GPUs to avoid cache fragmentation
- Proposal to cap eval-time memory or per-token latency as a competition rule clarification