PR #945

open

Record: Order-16 Frozen N-gram Oracle + Learned Gate + TTT — val_bpb 0.0274 (3-seed mean)

by TimPietruskyView on GitHub

val_bpb

0.0274

Architecture

Transformer

Optimizer

AdamW

Artifact Size

—

Training Techniques

Architecture

BigramHash

Added a hash-based n-gram embedding/cache component for token prediction support.

parameters: {"vocab":6144,"dim":128}

XSA

Uses XSA-all attention variant.

parameters: null

Partial RoPE

Applies rotary position embeddings to only part of the head dimension.

parameters: {"dimensions":16}

VE128

Uses VE128 on later layers.

parameters: {"layers":[9,10]}

LeakyReLU

Uses LeakyReLU squared in the MLP.

parameters: {"squared":true,"negative_slope":0.5}

KV head count

Uses fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":8}

MLP3x

Expanded MLP width to about 3.5x the model dimension.

parameters: {"multiplier":3.5}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"interval":50}

Quantization

int5

bits: 5

scope: all

Compression

zstd

level: null

Optimizer

AdamW

weight_decay: null

momentum: null

other_params: {"lr":0.001}

Test-Time Training

score-first TTT

parameters: {"epochs":1,"learning_rate":0.001,"adaptive_temperature":[0.9,1.05],"byte_weighted_loss":true}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Regularization

magnitude pruning

parameters: {"pruning":"3%"}

Other

other

Frozen order-16 n-gram oracle prefilled from training shards and blended with neural predictions via a learned multi-expert gate.

parameters: {"orders":[2,16],"buckets":4000000,"experts":17,"mixer_loss_weight":0.15,"neural_floor":0.05}

other

Complementary training downweights loss on tokens already well predicted by the oracle.

parameters: {"complement_alpha":0.5,"complement_threshold":0.3}

Novel Contributions

Order-16 frozen n-gram oracle prefilled from training shards
Learned multi-expert gate blending neural and per-order n-gram experts
Complementary training that focuses the neural model on oracle-hard tokens
Score-first test-time training with adaptive temperature
Combination of EMA, SWA, and int5 quantization for a compact high-performing submission