val_bpb
0.1582
Architecture
Transformer
Optimizer
—
Artifact Size
15.59 MB
Training Techniques
Architecture
learned mixer head
A Linear(512 → 7) head predicts per-token expert mixing weights over the neural model and n-gram orders 2-7.
parameters: {"input_dim":512,"output_dim":7}
frozen n-gram oracle
Precomputed n-gram tables from training data are used as a frozen lookup oracle during training.
parameters: {"orders":"2-7"}
MLP3.5x
Transformer MLP width is 3.5x the model dimension.
parameters: {"multiplier":3.5}
MHA 8/8
Multi-head attention configuration with 8 attention heads over 8 layers.
parameters: {"layers":8,"heads":8}
Quantization
mixed int5/int6
bits: null
scope: model weights
GPTQ
bits: null
scope: model weights
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: null
Evaluation
score-first backward-looking n-gram cache
parameters: {"orders":"2-7"}
Test-Time Training
none
parameters: {"ttt_epochs":0}
LR Schedule
matrix learning rate tuning
parameters: {"matrix_lr":0.03}
Other
other
Systematic hyperparameter screening across 79+ experiments to find the improved matrix learning rate.
parameters: {"experiments":79}
Novel Contributions
- Learned mixer head that predicts per-token expert weights
- Removing TTT entirely while improving performance
- Increasing MATRIX_LR from 0.025 to 0.03
- Systematic screening of 79+ experiments to discover the better learning rate
- Backward-looking score-first n-gram cache with learned mixing weights