PR #840

open

Record: 0.2873 BPB — Fine-Grained N-gram Cache (65K chunks)

by quietsmileView on GitHub
val_bpb
0.2873
Architecture
11L 512d GQA 8/4 Transformer
Optimizer
Parallel Muon
Artifact Size
~13.4 MB

Training Techniques

Quantization
GPTQ
bits: 5
scope: all
Architecture
GQA
Grouped-query attention with 8/4 head configuration
parameters: {"heads":"8/4"}
MLP3x
Expanded MLP width to 3.0x
parameters: {"multiplier":3}
XSA-4
Uses XSA-4 architectural component
parameters: {"variant":4}
LeakyReLU
LeakyReLU activation with squared usage
parameters: {"slope":0.9,"power":2}
BigramHash
Bigram hash component for n-gram-related modeling
parameters: {"buckets":4096}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: null
Compression
lzma
level: null
Evaluation
fine-grained n-gram cache chunked evaluation
parameters: {"chunk_tokens":65536,"backoff_order":"2-9","hash_buckets":4000000,"score_first":true,"cache_update_after_chunk":true}
Other
other
Entropy-adaptive alpha for n-gram backoff cache, varying by model confidence and n-gram order
parameters: null
other
Per-order multipliers for n-gram cache, suppressing low orders and boosting high orders
parameters: {"low_orders_multiplier":0.3,"high_orders_multiplier":2}
other
Perplexity-sorted shard ordering during training
parameters: null

Novel Contributions

  • Reducing NGRAM_EVAL_CHUNK_TOKENS from 1,000,000 to 65,536 for much more frequent n-gram cache updates
  • Demonstrating that cache update frequency is the dominant factor in n-gram BPB performance
  • Score-first evaluation where the cache is updated only after each chunk is fully scored
  • Fine-grained backward-looking n-gram cache evaluation without TTT or additional compute