PR #883

open

Order-13 N-gram Oracle + Score-First TTT (0.0308 BPB)

by THUQiXuanView on GitHub
val_bpb
0.0308
Architecture
Transformer
Optimizer
AdamW
Artifact Size
3.66MB

Training Techniques

Architecture
BackoffNgramMixer
GPU-vectorized logistic context mixer combining neural logits with order-2 through order-13 n-gram backoff probabilities.
parameters: {"max_order":13,"experts":13}
Test-Time Training
score-first TTT
parameters: {"phases":2}
Evaluation
sliding window eval
parameters: null
Quantization
int6
bits: 6
scope: final artifact
Compression
custom
level: null

Novel Contributions

  • Pre-filling order-2 through order-13 n-gram tables from the full training set before the training loop
  • Score-first test-time training where each validation chunk is fully scored before any weight updates
  • A pretrained n-gram oracle passed into evaluation to eliminate cold-start behavior
  • GPU-vectorized backoff n-gram mixer combining neural and n-gram predictions