PR #872

closed

E2E TTT: End-to-End Test-Time Training with Meta-Learning (1.0467 BPB)

by gowtham0992View on GitHub

val_bpb

1.0467

Architecture

Transformer

Optimizer

—

Artifact Size

13.12 MB

Training Techniques

Test-Time Training

score-first TTT

parameters: {"inner_loop":"gradient descent on MLP weights","meta_learning":true}

Other

other

MAML-style meta-learning with outer-loop backpropagation through inner gradient steps using create_graph=True to optimize initial weights for test-time adaptation

parameters: {"final_training_fraction":0.2}

Evaluation

sliding window eval

parameters: {"cache":["5-gram backoff","kNN-LM"]}

Architecture

LeakyReLU

MLP activation uses LeakyReLU squared

parameters: {"layers":3}

XSA

XSA applied across all layers

parameters: {"layers":11}

BigramHash

Bigram hash embedding component

parameters: {"size":2048}

Partial RoPE

Partial rotary positional embeddings

parameters: {"dimensions":16,"total_dimensions":64}

VE128

VE128 used in later layers

parameters: {"layers":[9,10]}

Value Residual

Value Residual Learning

parameters: null

Weight Averaging

EMA + Tight SWA

parameters: {"ema_decay":0.997,"swa_interval":50}

Quantization

GPTQ

bits: 6

scope: all

QAT

bits: null

scope: all

Regularization

magnitude pruning

parameters: {"sparsity":0.03,"timing":"post-quant"}

layerwise LN scale

parameters: {"formula":"1/sqrt(layer+1)"}

Sequence Length

sequence_length

train_length: null

eval_length: 128

Novel Contributions

End-to-end test-time training with MAML-style meta-learning and backpropagation through inner adaptation steps
Hidden-state kNN-LM cache over final-layer hidden states for semantic repetition beyond exact n-grams
Online 5-gram cache with adaptive, pre-committed confidence-based mixing
GPTQ calibration performed within the training budget