PR #931

open

Record: 0.0498 bpb - Packed Training N-gram Artifact + Learned Weighting Gate

by AnirudhRahulView on GitHub
val_bpb
0.0498
Architecture
Transformer
Optimizer
Artifact Size
15,857,871 bytes

Training Techniques

Architecture
BigramHash
Removed the bigram hash path to make room for the packed training n-gram cache while retaining warm low-order n-gram signal through the artifact.
parameters: {"vocab_size":0}
SmearGate
Learned multi-expert weighting gate over the neural model and n-gram experts for orders 2 through 9.
parameters: {"experts":"neural + n-gram order 2..9"}
Other
other
Packed a 32K-bucket order-2..9 training n-gram cache into the artifact as 32-bit count tables so evaluation starts with a pre-warmed cache.
parameters: {"buckets":32768,"orders":"2..9","count_table_bits":32}
Test-Time Training
score-first TTT
parameters: {"chunk_tokens":131072,"temperature":0.85,"freeze_blocks":2,"epochs":2,"learning_rate":0.0001}
Evaluation
stride-based eval
parameters: {"stride":64}
Regularization
magnitude pruning
parameters: {"prune_pct":0.05}

Novel Contributions

  • Learned weighting gate over neural and order-2..9 n-gram experts
  • Packed 32K-bucket training n-gram artifact serialized into the submission
  • Single-pass causal evaluation with pre-warmed cache and online updates
  • Removed bigram hash path to fit the packed cache under the 16MB artifact limit
  • Simplified evaluation by removing maturity decay and heuristic switching