PR #1639

open

Adaptive Test-Time Training (TTT) with continuous LR-scaling.

by kunwar-vikrantView on GitHub

val_bpb

1.0832

Architecture

Transformer

Optimizer

SGD

Artifact Size

15.28 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: matrices

int8

bits: 8

scope: embeddings

Architecture

depth recurrence

3-layer depth recurrence with loops activated during training

parameters: {"loops":"3-5","activated_at":"35% training"}

Partial RoPE

Partial rotary positional embeddings

parameters: {"layers":"16/64"}

XSA

XSA used in all layers

parameters: null

LeakyReLU

Leaky ReLU activation in the MLP

parameters: {"slope":0.5}

U-Net skip connections

Parallel residual / skip-style connections in later layers

parameters: {"layers":"L7+"}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Optimizer

Muon

weight_decay: 0.095

momentum: null

other_params: {"variant":"MuonEq-R"}

LR Schedule

warmdown

parameters: {"warmdown_frac":0.72}

Regularization

layerwise LN scale

parameters: null

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"momentum":0.9,"base_epochs":3,"adaptive_epochs":true,"max_epochs":5,"min_epochs":1,"ema_alpha":0.3}

Evaluation

sliding window eval

parameters: {"window":32768}

Compression

brotli

level: 11

Novel Contributions

Adaptive test-time training with per-chunk difficulty-aware epoch allocation
Use of chunk NLL relative to a running EMA mean to estimate chunk difficulty
Discovery that chunk-level NLL variance is too narrow for integer epoch allocation to matter
Proposal of continuous LR-scaling to avoid rounding dead-zones in adaptive TTT