PR #924

open

Order-16 Frozen N-gram Oracle + Score-First TTT (0.02801 BPB)

by THUQiXuanView on GitHub
val_bpb
0.0280
Architecture
Transformer
Optimizer
Muon
Artifact Size
12.8 MB

Training Techniques

Architecture
BigramHash
GPU-native multi-order backoff n-gram hashing tables for oracle predictions
parameters: {"orders":"2-16","buckets":4194304}
LeakyReLU
LeakyReLU squared activation in the MLP
parameters: {"squared":true}
GQA
Grouped query attention
parameters: {"query_heads":8,"kv_heads":4}
XSA
XSA used across all layers
parameters: {"layers":11}
VE128
Value residual enhancement in later layers
parameters: {"layers":[9,10]}
MLP3x
Three-times MLP stack
parameters: null
RoPE
Partial rotary positional embeddings
parameters: {"dimensions":"16/64"}
Weight Averaging
EMA + Tight SWA
parameters: {"ema_decay":0.997,"swa_interval":50}
Quantization
GPTQ-lite
bits: 6
scope: base model
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"epochs":1,"learning_rate":0.001}
Sequence Length
sequence_length
train_length: null
eval_length: 32000
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"adam":true}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}

Novel Contributions

  • Order-16 frozen n-gram oracle prefilled from all 8B training tokens
  • Score-first TTT where each eval chunk is fully scored before any updates
  • BackoffNgramMixer with GPU-native order-2 through order-16 hashing
  • Complementary training that downweights tokens already well predicted by the oracle
  • Order-16 scaling chosen as the best BPB/eval-time tradeoff under budget