PR #872
closedE2E TTT: End-to-End Test-Time Training with Meta-Learning (1.0467 BPB)
by gowtham0992View on GitHub
val_bpb
1.0467
Architecture
Transformer
Optimizer
—
Artifact Size
13.12 MB
Training Techniques
Test-Time Training
score-first TTT
parameters: {"inner_loop":"gradient descent on MLP weights","meta_learning":true}
Other
other
MAML-style meta-learning with outer-loop backpropagation through inner gradient steps using create_graph=True to optimize initial weights for test-time adaptation
parameters: {"final_training_fraction":0.2}
Evaluation
sliding window eval
parameters: {"cache":["5-gram backoff","kNN-LM"]}
Architecture
LeakyReLU
MLP activation uses LeakyReLU squared
parameters: {"layers":3}
XSA
XSA applied across all layers
parameters: {"layers":11}
BigramHash
Bigram hash embedding component
parameters: {"size":2048}
Partial RoPE
Partial rotary positional embeddings
parameters: {"dimensions":16,"total_dimensions":64}
VE128
VE128 used in later layers
parameters: {"layers":[9,10]}
Value Residual
Value Residual Learning
parameters: null
Weight Averaging
EMA + Tight SWA
parameters: {"ema_decay":0.997,"swa_interval":50}
Quantization
GPTQ
bits: 6
scope: all
QAT
bits: null
scope: all
Regularization
magnitude pruning
parameters: {"sparsity":0.03,"timing":"post-quant"}
layerwise LN scale
parameters: {"formula":"1/sqrt(layer+1)"}
Sequence Length
sequence_length
train_length: null
eval_length: 128
Novel Contributions
- End-to-end test-time training with MAML-style meta-learning and backpropagation through inner adaptation steps
- Hidden-state kNN-LM cache over final-layer hidden states for semantic repetition beyond exact n-grams
- Online 5-gram cache with adaptive, pre-committed confidence-based mixing
- GPTQ calibration performed within the training budget