PR #1550

open

Non-record: Pre-Quant AdamW TTT (Compiled) + SP8192 + Depth Recurrence — val_bpb 1.0587 (3-seed mean)

by translatingthenameView on GitHub

val_bpb

1.0587

Architecture

Transformer

Optimizer

AdamW

Artifact Size

~15.5 MB

Training Techniques

Test-Time Training

full TTT

parameters: {"learning_rate":0.0005,"epochs":6,"freeze_blocks":2,"batch_size":32,"sequence_length":2048,"compiled":true}

Quantization

GPTQ

bits: 6

scope: all

int8

bits: 8

scope: embeddings

Architecture

depth recurrence

Repeats layers 3-5 once to create 14 virtual layers from 11 physical layers.

parameters: {"physical_layers":11,"virtual_layers":14,"repeat_layers":[3,4,5]}

weight tying

Tied input and output embeddings.

parameters: null

Partial RoPE

Uses rotary position embeddings on only part of the head dimensions.

parameters: {"dimensions":"16/64"}

XSA

Applies XSA attention across all layers.

parameters: {"layers":11}

SmearGate

Uses SmearGate in the architecture.

parameters: null

LeakyReLU

Uses LeakyReLU squared activation in the MLP.

parameters: {"slope":0.5}

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

logit softcap

parameters: {"value":30}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"variant":"MuonEq-R","newton_schulz_steps":4}

AdamW

weight_decay: 0.095

momentum: null

other_params: {"scope":"embeddings","learning_rate":0.03}

AdamW

weight_decay: 0.02

momentum: null

other_params: {"scope":"scalars","learning_rate":0.02}

LR Schedule

warmdown

parameters: {"final_fraction":0.72,"target_lr":0}

cosine decay

parameters: {"final_multiplier":0.1}

Compression

Brotli

level: 11

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Novel Contributions

Non-record pre-quant AdamW TTT that violates Condition 3 by training on validation tokens before scoring them
Compiled TTT with torch.compile for roughly 2x speedup
Artifact budget engineering for SP8192, including VE dimension selection to avoid pruning
Depth recurrence combined with parallel residuals and XSA in a compact 11-layer Transformer
Empirical comparison of illegal pre-quant TTT versus legal score-first TTT boundary