PR #999

open

Record: 11L Muon TTT + Entropy-Adaptive Epochs (8×H100) — val_bpb 1.1179 (3-seed mean)

by aamodbhattView on GitHub

val_bpb

1.1179

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.9 MB

Training Techniques

Architecture

BigramHash

Bigram hash embedding component in the base stack.

parameters: {"size":1536}

XSA

Uses XSA on the last layers of the model.

parameters: {"last_n":4}

MLP3x

Three-times expanded MLP block.

parameters: null

LeakyReLU

LeakyReLU^2 activation in the MLP.

parameters: {"slope":0.5}

RoPE

Partial rotary positional embeddings.

parameters: {"dimensions":16}

VE128

Value residual enhancement on selected layers.

parameters: {"layers":[9,10]}

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_every":50}

Quantization

late QAT

bits: 6

scope: model

Compression

lzma

level: 7

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.002,"epochs":"2/3/4 adaptive","chunk_tokens":32768}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"ttt_muon":true,"newton_schulz_steps":3,"parallel":true}

LR Schedule

cosine decay

parameters: {"warmdown_steps":3500}

Other

other

Entropy-adaptive TTT epoch selection based on chunk uncertainty, assigning 2/3/4 epochs per chunk.

parameters: {"high_threshold":2.1,"low_threshold":1.75}

Novel Contributions

Muon-style Newton-Schulz orthogonalized updates in the test-time training loop
Entropy-adaptive epoch selection that allocates 2/3/4 epochs per chunk based on chunk uncertainty
Score-first TTT with global NLL synchronization across DDP ranks to avoid collective mismatch
Improved 3-seed mean val_bpb to 1.1179, beating the prior SOTA of 1.1194