PR #1076

closed

Record: Packed Causal N-gram + Dirichlet Backoff — val_bpb 0.0109 (3-seed mean, NEW SOTA)

by sofiabodView on GitHub
val_bpb
0.0109
Architecture
Transformer
Optimizer
Artifact Size
~1.5 MB

Training Techniques

Architecture
GPT
2-layer 128d GPT used as a vestigial neural backbone alongside n-gram cache scoring.
parameters: {"layers":2,"dimensions":128}
Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":2048}
Other
other
Packed causal n-gram cache with precomputed order 2-12 hash tables stored in 32K buckets and zstd-compressed in the artifact.
parameters: {"orders":"2-12","buckets_per_order":32768}
other
Dirichlet posterior backoff mixing with greedy highest-order-first backoff and count-confidence gating.
parameters: {"concentrations":[50,50,20,10,6,4,3,2.5,2,1.8,1.6],"confidence_scale":12}
other
Score-first evaluation with lookup before update and distributed prefill for warm cache start.
parameters: null
Sequence Length
sequence_length
train_length: null
eval_length: 2048

Novel Contributions

  • Packed causal n-gram cache with precomputed multi-order hash tables
  • Dirichlet posterior backoff mixing for n-gram/neural blending
  • Count-confidence gating for adaptive blending
  • Score-first single-pass evaluation with cache update after lookup
  • Distributed prefill to warm caches across ranks
  • Very small artifact size while achieving strong validation performance