PR #886

open

RFC: A framework for deciding the n-gram question

by abaybektursunView on GitHub
val_bpb
0.3779
Architecture
Transformer
Optimizer
Artifact Size
16MB

Training Techniques

Evaluation
sliding window eval
parameters: {"stride":64}
Architecture
BigramHash
Hash-table based n-gram caching/mixing built from already-scored tokens during evaluation.
parameters: {"orders":"2-9","buckets":4000000}
Test-Time Training
score-first TTT
parameters: null
LoRA TTT
parameters: {"epochs":8}
Quantization
int6
bits: 6
scope: all

Novel Contributions

  • Eval-time n-gram caching/mixing that preserves strict causality by using only already-scored tokens
  • Demonstration that a pure n-gram cache can outperform the neural base model on FineWeb validation
  • Finding that smaller hash tables with more collisions can improve BPB because collisions help counts cross the min_count threshold
  • Global all-reduce synchronization of n-gram hash table deltas across GPUs to avoid cache fragmentation
  • Proposal to cap eval-time memory or per-token latency as a competition rule clarification