val_bpb
1.6200
Architecture
Transformer
Optimizer
SGD
Artifact Size
—
Training Techniques
Architecture
BigramHash
N-gram logit boost using hashed n-gram tables with normalized softmax-based boosting.
parameters: null
Other
other
HedgeMixer online multiplicative-weights mixing between neural and neural+ngram experts.
parameters: null
Optimizer
SGD
weight_decay: null
momentum: 0.95
other_params: {"per_layer_lr":true}
Weight Averaging
Polyak averaging
parameters: null
Test-Time Training
score-first TTT
parameters: null
Regularization
logit bias
parameters: {"per_document":true}
Evaluation
sliding window eval
parameters: null
Quantization
int6
bits: 6
scope: all
Novel Contributions
- Normalized n-gram logit boost with softmax and collision-fix normalization
- HedgeMixer online multiplicative-weights expert mixing
- SGD momentum 0.95 test-time training with per-layer learning rates
- Polyak averaging during TTT
- Per-document online bias correction
- Score-first update ordering for TTT and n-gram/HedgeMixer updates
- Numba JIT acceleration and fallback chain