PR #685

closed

Record: Chained TTT — Cosine Recovery + Multi-Pass Scoring (3-seed mean val_bpb=1.0366)

by andrewbaggio1View on GitHub

val_bpb

1.0366

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15.62 MB

Training Techniques

Quantization

int6

bits: 6

scope: all

Architecture

MLP3x

Expanded MLP width to 3x in the architecture stack.

parameters: null

GQA

Uses 4 KV grouped-query attention heads.

parameters: {"kv_heads":4}

LeakyReLU

Uses LeakyReLU activation with slope 0.5.

parameters: {"negative_slope":0.5}

BigramHash

Includes BigramHash component in the model stack.

parameters: {"size":2048}

SmearGate

Includes SmearGate component in the model stack.

parameters: null

XSA4

Includes XSA4 component in the model stack.

parameters: null

Partial RoPE

Uses partial rotary positional embeddings.

parameters: null

Regularization

LN Scale

parameters: null

Weight Averaging

EMA

parameters: null

SWA

parameters: null

Initialization

OrthoInit

Orthogonal initialization.

Compression

zstd

level: 22

Test-Time Training

full TTT

parameters: {"phases":2,"phase_1":"cosine recovery","phase_2":"multi-pass score-first scoring","passes":3}

LR Schedule

cosine decay

parameters: {"epochs":20}

Optimizer

AdamW

weight_decay: null

momentum: null

other_params: {"per_layer_lr_groups":{"mlp.proj":3,"mlp.fc":0.5}}

Novel Contributions

Two-phase chained TTT combining cosine recovery with multi-pass scoring
Cosine recovery phase to recover from int6 quantization damage
Multi-pass score-first scoring across three shifted adaptation trajectories
Using min(NLL) across passes to reduce early-token penalty
Synergistic combination of recovery and ensembling-style test-time adaptation