PR #779

open

Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683)

by deanbrrView on GitHub
val_bpb
0.6683
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.63 MB

Training Techniques

Quantization
int5
bits: 5
scope: all
Architecture
BigramHash
Uses a hash-based n-gram cache / mixer with multi-order backoff over orders 2-7.
parameters: {"buckets":4096}
MLP3x
Three-layer MLP block in the model architecture.
parameters: null
LeakyReLU
Uses LeakyReLU(0.5)^2 nonlinearity.
parameters: {"slope":0.5}
XSA
XSA applied in all 11 layers.
parameters: {"layers":11}
Partial RoPE
Partial rotary positional embeddings.
parameters: {"dimensions":"16/64"}
tied embeddings
Input and output embeddings are tied.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Evaluation
stride-based eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"qttt":1,"eta":0.02,"learning_rate":0.00003,"chunk_tokens":1048576,"epochs":1,"adaptive_lr":0,"polyak":0,"freeze_blocks":1}
LR Schedule
none
parameters: {"adaptive_lr":0}
Regularization
layerwise LN scale
parameters: null
Other
other
Entropy-adaptive n-gram mixing with multi-order backoff (orders 2-7) using only already-scored tokens; mixed probability always applied without oracle gating.
parameters: {"alpha_formula":"0.05 + 0.55 * sigmoid(2 * (H - 4.0))"}

Novel Contributions

  • BackoffNgramMixer with multi-order n-gram backoff from orders 2-7
  • Entropy-adaptive mixing coefficient based on model entropy rather than target peeking
  • Drift-free TTT configuration that avoids late-chunk degradation
  • Score-first, backward-looking test-time training compliant with competition rules
  • Pure eval-time n-gram cache improvement requiring no retraining or architectural changes