PR #843

open

Record: Order-12 N-gram Backoff + 256K Chunks — 0.2834 BPB

by quietsmileView on GitHub
val_bpb
0.2834
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~13.4 MB

Training Techniques

Quantization
GPTQ
bits: 5
scope: model
Architecture
XSA
Uses XSA-4 as part of the model architecture.
parameters: {"variant":4}
BigramHash
Includes a BigramHash component with 4096 buckets.
parameters: {"buckets":4096}
MLP3x
Uses an expanded MLP with 3.0x width.
parameters: {"multiplier":3}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: null
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Compression
LZMA
level: null
Evaluation
n-gram backoff cache
parameters: {"order":12,"chunk_tokens":256000,"alpha_max":0.7,"hash_primes_added":6}
Other
other
Entropy-adaptive n-gram mixing with per-order multipliers and score-first cache updates after each chunk.
parameters: {"score_first":true,"cache_update_timing":"after scoring each chunk","low_order_multiplier":0.3,"high_order_multiplier":2}

Novel Contributions

  • Extended eval-time n-gram backoff from order 9 to order 12
  • Added 6 additional hash primes for the n-gram cache
  • Reduced eval chunk size from 1M to 256K tokens for faster cache refresh
  • Increased alpha_max from 0.60 to 0.70 for stronger high-entropy n-gram mixing
  • Purely eval-time changes with no training modifications
  • Score-first compliant cache updates only after scoring each chunk