PR #1601

open

[non-record] Sharpness-Aware Minimization (SAM) Inner Loop for Meta-TTT

val_bpb

1.1190

Architecture

Transformer

Optimizer

SGD

Artifact Size

15.88 MB

Training Techniques

Architecture

U-Net skip connections

11-layer U-Net GPT with encoder-decoder skip connections.

parameters: {"layers":11,"encoder_layers":5,"decoder_layers":6}

GQA

Grouped query attention with 8Q / 4KV heads.

parameters: {"q_heads":8,"kv_heads":4}

BigramHash

Bigram hash embedding with position-conditional logic.

parameters: {"dimensions":"4096x64"}

XSA

XSA enabled across all blocks.

parameters: {"blocks":11}

Test-Time Training

full TTT

parameters: {"every":4,"inner_outer_split":"cross-chunk","delta_loss":true,"optimizer":"SAM","rho":0.05}

Optimizer

SGD

weight_decay: null

momentum: null

other_params: {"inner_loop":"SAM","meta_training":"FOMAML"}

Compression

lzma

level: null

Quantization

GPTQ

bits: 6

scope: all

Weight Averaging

EMA

parameters: {"post_ema_float_baseline":1.1384}

Replaced the MetaSGD inner loop with Sharpness-Aware Minimization (SAM) for Meta-TTT.
Applied SAM perturbation in the FOMAML inner loop to seek flatter minima before test-time adaptation.
Reported that SAM did not improve the TTT delta, which remained invariant at about -0.023 bpb.
Performed memory and throughput analysis showing SAM increased peak GPU memory and reduced completed steps.
Included weight-space analysis suggesting SAM stayed in the same basin as the MetaSGD variant.