PR #1601

open

[non-record] Sharpness-Aware Minimization (SAM) Inner Loop for Meta-TTT

by SPTholeView on GitHub
val_bpb
1.1190
Architecture
Transformer
Optimizer
SGD
Artifact Size
15.88 MB

Training Techniques

Architecture
U-Net skip connections
11-layer U-Net GPT with encoder-decoder skip connections.
parameters: {"layers":11,"encoder_layers":5,"decoder_layers":6}
GQA
Grouped query attention with 8Q / 4KV heads.
parameters: {"q_heads":8,"kv_heads":4}
BigramHash
Bigram hash embedding with position-conditional logic.
parameters: {"dimensions":"4096x64"}
XSA
XSA enabled across all blocks.
parameters: {"blocks":11}
Test-Time Training
full TTT
parameters: {"every":4,"inner_outer_split":"cross-chunk","delta_loss":true,"optimizer":"SAM","rho":0.05}
Optimizer
SGD
weight_decay: null
momentum: null
other_params: {"inner_loop":"SAM","meta_training":"FOMAML"}
Compression
lzma
level: null
Quantization
GPTQ
bits: 6
scope: all
Weight Averaging
EMA
parameters: {"post_ema_float_baseline":1.1384}

Novel Contributions

  • Replaced the MetaSGD inner loop with Sharpness-Aware Minimization (SAM) for Meta-TTT.
  • Applied SAM perturbation in the FOMAML inner loop to seek flatter minima before test-time adaptation.
  • Reported that SAM did not improve the TTT delta, which remained invariant at about -0.023 bpb.
  • Performed memory and throughput analysis showing SAM increased peak GPU memory and reduced completed steps.
  • Included weight-space analysis suggesting SAM stayed in the same basin as the MetaSGD variant.