PR #1802

open

Record: SP8192 + Polar Express NS + Multi-Phase Global TTT — val_bpb 1.0771 (3-seed mean)

by aamodbhattView on GitHub

val_bpb

1.0771

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.99 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: attention/MLP matrices

GPTQ

bits: 8

scope: token embeddings

Architecture

depth recurrence

Encoder/decoder layer recurrence with repeated layers during generation/adaptation.

parameters: {"encoder":[0,1,2,3,4,5,3,4],"decoder":[5,3,4,5,6,7,8,9,10]}

Partial RoPE

Uses rotary position embeddings on only part of the head dimensions.

parameters: {"dimensions":"16/64"}

LeakyReLU

Uses LeakyReLU activation in the MLP.

parameters: {"slope":0.5}

U-Net skip connections

Skip connections gated in a U-Net-like pattern.

parameters: null

weight tying

Tied input and output embeddings.

parameters: null

GQA

Grouped-query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Regularization

logit softcap

parameters: {"value":30}

layerwise LN scale

parameters: null

Optimizer

Muon

weight_decay: 0.095

momentum: null

other_params: {"polar_express_ns_coefficients":true,"backend_steps":5}

SGD

weight_decay: null

momentum: 0.9

other_params: {"learning_rate":0.015,"gradient_clip":1}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":1984}

Test-Time Training

full TTT

parameters: {"phases":3,"learning_rate":0.015,"momentum":0.9,"cosine_decay":true,"score_before_update":true}

LR Schedule

warmdown

parameters: {"min_lr_floor":0.1}

cosine decay

parameters: {"applied_to":"training and TTT chunks"}

Compression

brotli

level: 11

Novel Contributions

Multi-Phase Global TTT that scores all windows globally, trains all chunks, and repeats across phases
Polar Express Newton-Schulz coefficients replacing fixed Muon coefficients
MIN_LR warmdown floor at 0.10 to preserve learning updates late in training
Combined SP8192, GPTQ SDClip quantization, and depth recurrence into a sub-16MB submission