PR #1502

open

[Non Record] Learn to Learn: Meta-Learning-TTT Redesign — Cross-Chunk FOMAML + Delta-Loss + MetaSGD

by SPTholeView on GitHub

val_bpb

1.1147

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.14 MB

Training Techniques

Architecture

U-Net skip connections

11-layer U-Net GPT with encoder-decoder skip connections.

parameters: {"layers":11}

GQA

Grouped-query attention with 8 query heads and 4 key-value heads.

parameters: {"q_heads":8,"kv_heads":4}

weight tying

Tied token embedding and output head weights.

parameters: null

BigramHash

Position-conditional bigram hash table with word-start/within-word bucket split.

parameters: {"table_size":"4096x64"}

RoPE

Partial rotary position embeddings applied to a subset of head dimensions.

parameters: {"dimensions":"16/64"}

XSA

Cross-layer shared attention with banked Q/K/V/O and MLP weights adapted during TTT.

parameters: {"layers":11}

Value embeddings added on the last 4 layers.

parameters: {"layers":[7,8,9,10]}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"used_with":"AdamW for embeddings/scalars"}

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":"embeddings and scalars"}

Weight Averaging

EMA

parameters: {"decay":0.998}

SWA

parameters: {"interval":50,"phase":"warmdown"}

LR Schedule

cosine decay

parameters: {"phase":"warmdown"}

Quantization

GPTQ

bits: 6

scope: attn+MLP

int8

bits: 8

scope: embeddings

late QAT

bits: null

scope: all

Compression

lzma

level: null

Test-Time Training

score-first TTT

parameters: {"optimizer":"SGD","lr_schedule":"cosine decay","momentum":0.9,"epochs":4}

Other

other

Cross-chunk FOMAML meta-learning split inner-loop adaptation and outer-loop evaluation across different chunks/documents.

parameters: {"every_n_steps":4}

other

Delta-loss outer objective that rewards improvement from adaptation by combining post-adaptation loss with a pre/post loss difference term.

parameters: {"loss_weight":0.5,"delta_weight":0.3}

other

MetaSGD with learned per-bank-per-layer learning-rate scales for inner-loop adaptation.

parameters: {"total_parameters":66}

Novel Contributions

Cross-chunk FOMAML meta-TTT split using different documents for inner and outer loops
Delta-loss outer objective that explicitly rewards adaptation improvement
MetaSGD per-bank-per-layer learned learning-rate scales
Finding that TTT delta is invariant across no-meta, original FOMAML, and redesigned meta-TTT variants
Evidence that the TTT ceiling is architecture-limited rather than initialization-limited
Strict-load hotfix for exported checkpoints with excluded MetaSGD parameters