PR #687

closed

Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745)

val_bpb
1.0745
Architecture
Optimizer
Artifact Size
<15.5 MB

Training Techniques

Quantization
GPTQ
bits: null
scope: model weights
Weight Averaging
EMA
parameters: null
Test-Time Training
TTT
parameters: {"learning_rate":0.0001,"chunk_tokens":131072,"use_mixer":true}
Other
other
5-expert logistic context mixer using Hedge algorithm to blend neural, unigram, bigram, trigram, and entropy experts in log-probability space during TTT evaluation
parameters: {"experts":["neural","unigram","bigram","trigram","entropy"],"online_update":"log_w -= eta * loss"}
other
Incremental n-gram table construction from already-scored tokens only
parameters: {"ngram_order":[1,2,3],"trigram_buckets":65536}

Novel Contributions

  • 5-expert Hedge-based logistic context mixer
  • Online blending of neural and n-gram experts in log-probability space during TTT evaluation
  • Incremental n-gram statistics built only from already-scored tokens
  • GPTQ calibration performed within the training budget
  • Three-seed mean record validation score