PR #834
openRecord: 0.1663 BPB - N-gram-Aware Training + Frozen N-gram Oracle + Backoff TTT
by AnirudhRahulView on GitHub
val_bpb
0.1663
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.76 MB
Training Techniques
Architecture
Linear gate head
Adds a learned multi-expert routing head (Linear 512->7) on top of the transformer to mix neural and n-gram experts.
parameters: {"input_dim":512,"output_dim":7}
BigramHash
Uses a backoff n-gram mixer with hashed count tables for n-gram experts.
parameters: {"orders":[2,3,4,5,6,7]}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"combined_with":"Adam","ema":true}
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: {"stride":64,"chunk_tokens":1048576}
Test-Time Training
score-first TTT
parameters: {"epochs":1,"freeze_blocks":1,"learning_rate":0.00003}
Sequence Length
sequence_length
train_length: null
eval_length: 1048576
LR Schedule
cosine decay
parameters: {"across_chunks":true}
Regularization
layerwise LN scale
parameters: null
Other
other
Frozen n-gram oracle precomputed from training data and kept read-only during training to enable efficient gate learning.
parameters: {"prefill_counted_in_wallclock":true}
other
Learned multi-expert gate trained directly on next-token likelihood using a mixed probability objective over neural and n-gram experts.
parameters: {"experts":7,"mixer_loss_weight":0.1,"neural_floor":0.05}
Novel Contributions
- Learned multi-expert gate that replaces a hand-crafted entropy heuristic for routing between neural and n-gram experts
- Frozen n-gram oracle precomputed from training data to make gate training efficient within the wallclock budget
- Direct optimization of the gate using next-token likelihood over a mixture of experts
- Backoff TTT with score-first causal evaluation using a fresh validation cache
- GPU-native backoff n-gram mixer implementation