PR #869

open

Record: N-gram Two-Pass Score-First Evaluation (0.1290 BPB)

by THUQiXuanView on GitHub
val_bpb
0.1290
Architecture
Transformer
Optimizer
Muon
Artifact Size
12.5MB

Training Techniques

Quantization
GPTQ
bits: 5
scope: all
Architecture
weight tying
Tied input and output embeddings.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
SWA
parameters: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: null
Sequence Length
sequence_length
train_length: null
eval_length: 2048
Other
other
Two-pass score-first N-gram evaluation with full-cache rescoring over all chunks.
parameters: {"chunks":63,"max_order":9,"buckets":4194304}
other
Order-Adaptive Entropy Gating (OAEG) for mixing neural and N-gram predictions.
parameters: {"alpha_max":0.7,"order_mults":[0.3,0.3,0.97,2,2,2,2,2]}

Novel Contributions

  • Score-first two-pass N-gram evaluation with full-cache rescoring
  • Legal use of validation-data N-gram cache built sequentially before rescoring
  • Order-Adaptive Entropy Gating to mix neural and N-gram probabilities
  • Evaluation stride increased to 64 for faster inference with unchanged BPB
  • 9-gram cache over all 63 validation chunks