PR #925

open

Record: Frozen N-gram Oracle (Order-16) + Score-First TTT (0.02807 BPB)

by THUQiXuanView on GitHub
val_bpb
0.0281
Architecture
Transformer
Optimizer
Muon
Artifact Size
12.9MB

Training Techniques

Architecture
BigramHash
GPU-native multi-order n-gram backoff oracle with hashed count tables for context-based prediction.
parameters: {"buckets":4194304,"max_order":16,"orders":"2-16"}
LeakyReLU
Uses LeakyReLU squared activation in the MLP stack.
parameters: {"squared":true}
RoPE
Partial rotary positional embeddings.
parameters: {"dimensions":"16/64"}
XSA
XSA applied across all layers.
parameters: {"layers":11}
VE128
Value residual enhancement used in later layers.
parameters: {"layers":[9,10]}
MLP3x
Three-times MLP stack.
parameters: {"multiplier":3}
BigramHash
Bigram hash component used in the base architecture.
parameters: {"size":6144}
Gated Attention
Multi-expert alpha head mixes neural and n-gram experts via learned softmax gating.
parameters: {"experts":16,"hidden_size":512}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"adamw":true}
Weight Averaging
EMA + Tight SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"epochs":1,"learning_rate":0.001}
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
weight decay
parameters: null

Novel Contributions

  • Order-16 frozen n-gram oracle prefilled from all training tokens
  • 4M-bucket GPU-native backoff n-gram tables
  • Learned multi-expert alpha head to mix neural and n-gram experts
  • Complementary training that downweights already well-predicted tokens
  • Score-first test-time training evaluation protocol
  • Order-16 scaling with budget-aware evaluation