PR #779

open

Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683)

by deanbrrView on GitHub

val_bpb

0.6683

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.63 MB

Training Techniques

Quantization

int5

bits: 5

scope: all

Architecture

BigramHash

Uses a hash-based n-gram cache / mixer with multi-order backoff over orders 2-7.

parameters: {"buckets":4096}

MLP3x

Three-layer MLP block in the model architecture.

parameters: null

LeakyReLU

Uses LeakyReLU(0.5)^2 nonlinearity.

parameters: {"slope":0.5}

XSA

XSA applied in all 11 layers.

parameters: {"layers":11}

Partial RoPE

Partial rotary positional embeddings.

parameters: {"dimensions":"16/64"}

tied embeddings

Input and output embeddings are tied.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Evaluation

stride-based eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"qttt":1,"eta":0.02,"learning_rate":0.00003,"chunk_tokens":1048576,"epochs":1,"adaptive_lr":0,"polyak":0,"freeze_blocks":1}

LR Schedule

none

parameters: {"adaptive_lr":0}

Regularization

layerwise LN scale

parameters: null

Other

other

Entropy-adaptive n-gram mixing with multi-order backoff (orders 2-7) using only already-scored tokens; mixed probability always applied without oracle gating.

parameters: {"alpha_formula":"0.05 + 0.55 * sigmoid(2 * (H - 4.0))"}

Novel Contributions

BackoffNgramMixer with multi-order n-gram backoff from orders 2-7
Entropy-adaptive mixing coefficient based on model entropy rather than target peeking
Drift-free TTT configuration that avoids late-chunk degradation
Score-first, backward-looking test-time training compliant with competition rules
Pure eval-time n-gram cache improvement requiring no retraining or architectural changes