PR #1222

open

[Non-record] TTT-E2E: Meta-learned test-time training via FOMAML

by abaybektursunView on GitHub

val_bpb

1.4707

Architecture

Transformer

Optimizer

AdamW

Artifact Size

—

Training Techniques

Architecture

LeakyReLU

Adds rank-256 prime MLPs to the last 3 transformer blocks, each with its own RMSNorm, running before the main MLP; prime MLPs use LeakyReLU(0.5)^2.

parameters: {"layers":3,"rank":256,"prime_layers":[8,9,10]}

Initialization

zero-init

Down projections of the prime MLPs are zero-initialized so the model starts identical to the baseline.

Optimizer

SGD

weight_decay: null

momentum: null

other_params: {"inner_loop":true,"used_for":"test-time adaptation and FOMAML inner loop"}

AdamW

weight_decay: null

momentum: null

other_params: {"outer_loop":true,"used_for":"prime initialization meta-update"}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.01,"chunk_size":1024,"adapted_components":"prime MLPs only"}

Other

other

Two-phase end-to-end meta-learning with FOMAML: standard pretraining followed by meta-fine-tuning to learn an initialization that adapts well at test time.

parameters: {"phase1_steps":7200,"phase2_steps":1500,"inner_steps":1}

Sequence Length

sequence_length

train_length: 2048

eval_length: 1024

Novel Contributions

Applies TTT-E2E / meta-learned test-time training to Parameter Golf
Uses FOMAML to meta-learn prime MLP initializations for adaptation
Introduces rank-256 prime MLPs in the last three transformer blocks
Implements legal score-first test-time adaptation to preserve causality
Demonstrates that adaptation recovers part of the Phase 2 degradation