PR #859

open

Record: 0.1582 BPB — Learned Mixer Head + No TTT + Matrix LR 0.03

val_bpb
0.1582
Architecture
Transformer
Optimizer
Artifact Size
15.59 MB

Training Techniques

Architecture
learned mixer head
A Linear(512 → 7) head predicts per-token expert mixing weights over the neural model and n-gram orders 2-7.
parameters: {"input_dim":512,"output_dim":7}
frozen n-gram oracle
Precomputed n-gram tables from training data are used as a frozen lookup oracle during training.
parameters: {"orders":"2-7"}
MLP3.5x
Transformer MLP width is 3.5x the model dimension.
parameters: {"multiplier":3.5}
MHA 8/8
Multi-head attention configuration with 8 attention heads over 8 layers.
parameters: {"layers":8,"heads":8}
Quantization
mixed int5/int6
bits: null
scope: model weights
GPTQ
bits: null
scope: model weights
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: null
Evaluation
score-first backward-looking n-gram cache
parameters: {"orders":"2-7"}
Test-Time Training
none
parameters: {"ttt_epochs":0}
LR Schedule
matrix learning rate tuning
parameters: {"matrix_lr":0.03}
Other
other
Systematic hyperparameter screening across 79+ experiments to find the improved matrix learning rate.
parameters: {"experiments":79}

Novel Contributions

  • Learned mixer head that predicts per-token expert weights
  • Removing TTT entirely while improving performance
  • Increasing MATRIX_LR from 0.025 to 0.03
  • Systematic screening of 79+ experiments to discover the better learning rate
  • Backward-looking score-first n-gram cache with learned mixing weights