PR #834

open

Record: 0.1663 BPB - N-gram-Aware Training + Frozen N-gram Oracle + Backoff TTT

by AnirudhRahulView on GitHub

val_bpb

0.1663

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.76 MB

Training Techniques

Architecture

Linear gate head

Adds a learned multi-expert routing head (Linear 512->7) on top of the transformer to mix neural and n-gram experts.

parameters: {"input_dim":512,"output_dim":7}

BigramHash

Uses a backoff n-gram mixer with hashed count tables for n-gram experts.

parameters: {"orders":[2,3,4,5,6,7]}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"combined_with":"Adam","ema":true}

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

zstd

level: null

Evaluation

sliding window eval

parameters: {"stride":64,"chunk_tokens":1048576}

Test-Time Training

score-first TTT

parameters: {"epochs":1,"freeze_blocks":1,"learning_rate":0.00003}

Sequence Length

sequence_length

train_length: null

eval_length: 1048576

LR Schedule

cosine decay

parameters: {"across_chunks":true}

Regularization

layerwise LN scale

parameters: null

Other

other

Frozen n-gram oracle precomputed from training data and kept read-only during training to enable efficient gate learning.

parameters: {"prefill_counted_in_wallclock":true}

other

Learned multi-expert gate trained directly on next-token likelihood using a mixed probability objective over neural and n-gram experts.

parameters: {"experts":7,"mixer_loss_weight":0.1,"neural_floor":0.05}

Novel Contributions

Learned multi-expert gate that replaces a hand-crafted entropy heuristic for routing between neural and n-gram experts
Frozen n-gram oracle precomputed from training data to make gate training efficient within the wallclock budget
Direct optimization of the gate using next-token likelihood over a mixture of experts
Backoff TTT with score-first causal evaluation using a fresh validation cache
GPU-native backoff n-gram mixer implementation