PR #688

open

Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745)

val_bpb
1.0745
Architecture
Transformer
Optimizer
AdamW
Artifact Size
<15.5 MB

Training Techniques

Quantization
GPTQ
bits: 5
scope: all
Architecture
BigramHash
Hashed bigram table used as part of the 5-expert context mixer / model additions.
parameters: {"size":6144,"dim":128}
XSA
Applied across all layers.
parameters: {"layers":11,"window_size":8}
Partial RoPE
Rotary positional embeddings applied partially.
parameters: {"dimensions":"16/64"}
MLP3x
Three-layer MLP with LeakyReLU activation.
parameters: {"activation":"LeakyReLU(0.5)^2"}
VE128
VE128 enabled in later layers.
parameters: {"layers":[9,10]}
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":32,"seq_len":2048}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.0001,"chunk_tokens":131072,"epochs":3,"polyak_decay":0.998,"frozen_blocks":9}
Sequence Length
sequence_length
train_length: 131072
eval_length: 2048
LR Schedule
cosine decay
parameters: {"adaptive_lr_max_mult":3}
Regularization
layerwise LN scale
parameters: {"formula":"1/sqrt(layer+1)"}
Other
other
5-expert Hedge/multiplicative-weights logistic context mixer blending neural, unigram, bigram, trigram, and entropy experts in log-probability space.
parameters: {"eta":0.1}

Novel Contributions

  • 5-expert Hedge-based logistic context mixer
  • Online GPU-vectorized context mixing in log-probability space
  • Incremental n-gram tables built only from already-scored tokens
  • Score-first test-time training pipeline
  • GPTQ-calibrated model with int5 quantization and zstd compression