PR #685

closed

Record: Chained TTT — Cosine Recovery + Multi-Pass Scoring (3-seed mean val_bpb=1.0366)

by andrewbaggio1View on GitHub
val_bpb
1.0366
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.62 MB

Training Techniques

Quantization
int6
bits: 6
scope: all
Architecture
MLP3x
Expanded MLP width to 3x in the architecture stack.
parameters: null
GQA
Uses 4 KV grouped-query attention heads.
parameters: {"kv_heads":4}
LeakyReLU
Uses LeakyReLU activation with slope 0.5.
parameters: {"negative_slope":0.5}
BigramHash
Includes BigramHash component in the model stack.
parameters: {"size":2048}
SmearGate
Includes SmearGate component in the model stack.
parameters: null
XSA4
Includes XSA4 component in the model stack.
parameters: null
Partial RoPE
Uses partial rotary positional embeddings.
parameters: null
Regularization
LN Scale
parameters: null
Weight Averaging
EMA
parameters: null
SWA
parameters: null
Initialization
OrthoInit
Orthogonal initialization.
Compression
zstd
level: 22
Test-Time Training
full TTT
parameters: {"phases":2,"phase_1":"cosine recovery","phase_2":"multi-pass score-first scoring","passes":3}
LR Schedule
cosine decay
parameters: {"epochs":20}
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"per_layer_lr_groups":{"mlp.proj":3,"mlp.fc":0.5}}

Novel Contributions

  • Two-phase chained TTT combining cosine recovery with multi-pass scoring
  • Cosine recovery phase to recover from int6 quantization damage
  • Multi-pass score-first scoring across three shifted adaptation trajectories
  • Using min(NLL) across passes to reduce early-token penalty
  • Synergistic combination of recovery and ensembling-style test-time adaptation