PR #945
openRecord: Order-16 Frozen N-gram Oracle + Learned Gate + TTT — val_bpb 0.0274 (3-seed mean)
by TimPietruskyView on GitHub
val_bpb
0.0274
Architecture
Transformer
Optimizer
AdamW
Artifact Size
—
Training Techniques
Architecture
BigramHash
Added a hash-based n-gram embedding/cache component for token prediction support.
parameters: {"vocab":6144,"dim":128}
XSA
Uses XSA-all attention variant.
parameters: null
Partial RoPE
Applies rotary position embeddings to only part of the head dimension.
parameters: {"dimensions":16}
VE128
Uses VE128 on later layers.
parameters: {"layers":[9,10]}
LeakyReLU
Uses LeakyReLU squared in the MLP.
parameters: {"squared":true,"negative_slope":0.5}
KV head count
Uses fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":8}
MLP3x
Expanded MLP width to about 3.5x the model dimension.
parameters: {"multiplier":3.5}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"interval":50}
Quantization
int5
bits: 5
scope: all
Compression
zstd
level: null
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"lr":0.001}
Test-Time Training
score-first TTT
parameters: {"epochs":1,"learning_rate":0.001,"adaptive_temperature":[0.9,1.05],"byte_weighted_loss":true}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Regularization
magnitude pruning
parameters: {"pruning":"3%"}
Other
other
Frozen order-16 n-gram oracle prefilled from training shards and blended with neural predictions via a learned multi-expert gate.
parameters: {"orders":[2,16],"buckets":4000000,"experts":17,"mixer_loss_weight":0.15,"neural_floor":0.05}
other
Complementary training downweights loss on tokens already well predicted by the oracle.
parameters: {"complement_alpha":0.5,"complement_threshold":0.3}
Novel Contributions
- Order-16 frozen n-gram oracle prefilled from training shards
- Learned multi-expert gate blending neural and per-order n-gram experts
- Complementary training that focuses the neural model on oracle-hard tokens
- Score-first test-time training with adaptive temperature
- Combination of EMA, SWA, and int5 quantization for a compact high-performing submission