PR #1502
open[Non Record] Learn to Learn: Meta-Learning-TTT Redesign — Cross-Chunk FOMAML + Delta-Loss + MetaSGD
by SPTholeView on GitHub
val_bpb
1.1147
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.14 MB
Training Techniques
Architecture
U-Net skip connections
11-layer U-Net GPT with encoder-decoder skip connections.
parameters: {"layers":11}
GQA
Grouped-query attention with 8 query heads and 4 key-value heads.
parameters: {"q_heads":8,"kv_heads":4}
weight tying
Tied token embedding and output head weights.
parameters: null
BigramHash
Position-conditional bigram hash table with word-start/within-word bucket split.
parameters: {"table_size":"4096x64"}
RoPE
Partial rotary position embeddings applied to a subset of head dimensions.
parameters: {"dimensions":"16/64"}
XSA
Cross-layer shared attention with banked Q/K/V/O and MLP weights adapted during TTT.
parameters: {"layers":11}
VE
Value embeddings added on the last 4 layers.
parameters: {"layers":[7,8,9,10]}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"used_with":"AdamW for embeddings/scalars"}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"embeddings and scalars"}
Weight Averaging
EMA
parameters: {"decay":0.998}
SWA
parameters: {"interval":50,"phase":"warmdown"}
LR Schedule
cosine decay
parameters: {"phase":"warmdown"}
Quantization
GPTQ
bits: 6
scope: attn+MLP
int8
bits: 8
scope: embeddings
late QAT
bits: null
scope: all
Compression
lzma
level: null
Test-Time Training
score-first TTT
parameters: {"optimizer":"SGD","lr_schedule":"cosine decay","momentum":0.9,"epochs":4}
Other
other
Cross-chunk FOMAML meta-learning split inner-loop adaptation and outer-loop evaluation across different chunks/documents.
parameters: {"every_n_steps":4}
other
Delta-loss outer objective that rewards improvement from adaptation by combining post-adaptation loss with a pre/post loss difference term.
parameters: {"loss_weight":0.5,"delta_weight":0.3}
other
MetaSGD with learned per-bank-per-layer learning-rate scales for inner-loop adaptation.
parameters: {"total_parameters":66}
Novel Contributions
- Cross-chunk FOMAML meta-TTT split using different documents for inner and outer loops
- Delta-loss outer objective that explicitly rewards adaptation improvement
- MetaSGD per-bank-per-layer learned learning-rate scales
- Finding that TTT delta is invariant across no-meta, original FOMAML, and redesigned meta-TTT variants
- Evidence that the TTT ceiling is architecture-limited rather than initialization-limited
- Strict-load hotfix for exported checkpoints with excluded MetaSGD parameters