PR #1014

open

N-gram logit boost + HedgeMixer + score-first TTT

by haimianbaobao007View on GitHub
val_bpb
1.6200
Architecture
Transformer
Optimizer
SGD
Artifact Size

Training Techniques

Architecture
BigramHash
N-gram logit boost using hashed n-gram tables with normalized softmax-based boosting.
parameters: null
Other
other
HedgeMixer online multiplicative-weights mixing between neural and neural+ngram experts.
parameters: null
Optimizer
SGD
weight_decay: null
momentum: 0.95
other_params: {"per_layer_lr":true}
Weight Averaging
Polyak averaging
parameters: null
Test-Time Training
score-first TTT
parameters: null
Regularization
logit bias
parameters: {"per_document":true}
Evaluation
sliding window eval
parameters: null
Quantization
int6
bits: 6
scope: all

Novel Contributions

  • Normalized n-gram logit boost with softmax and collision-fix normalization
  • HedgeMixer online multiplicative-weights expert mixing
  • SGD momentum 0.95 test-time training with per-layer learning rates
  • Polyak averaging during TTT
  • Per-document online bias correction
  • Score-first update ordering for TTT and n-gram/HedgeMixer updates
  • Numba JIT acceleration and fallback chain