PR #1638

closed

Add Adaptive TTT experiment results

by kunwar-vikrantView on GitHub

val_bpb

1.0832

Architecture

Transformer

Optimizer

SGD

Artifact Size

—

Training Techniques

Test-Time Training

full TTT

parameters: {"adaptive_epochs":true,"max_epochs":null,"min_epochs":null,"ema":null}

Optimizer

SGD

weight_decay: null

momentum: null

other_params: {"used_for_ttt":true,"reverted_from":"AdamW"}

Other

other

Adaptive TTT allocates more test-time training epochs to harder chunks.

parameters: {"hyperparams":["TTT_ADAPTIVE","TTT_MAX_EPOCHS","TTT_MIN_EPOCHS","TTT_ADAPT_EMA"]}

Quantization

post-training quantization

bits: null

scope: all