val_bpb
1.0832
Architecture
Transformer
Optimizer
SGD
Artifact Size
—
Training Techniques
Test-Time Training
full TTT
parameters: {"adaptive_epochs":true,"max_epochs":null,"min_epochs":null,"ema":null}
Optimizer
SGD
weight_decay: null
momentum: null
other_params: {"used_for_ttt":true,"reverted_from":"AdamW"}
Other
other
Adaptive TTT allocates more test-time training epochs to harder chunks.
parameters: {"hyperparams":["TTT_ADAPTIVE","TTT_MAX_EPOCHS","TTT_MIN_EPOCHS","TTT_ADAPT_EMA"]}
Quantization
post-training quantization
bits: null
scope: all
Novel Contributions
- Adaptive test-time training that allocates more epochs to harder chunks
- Reverted TTT optimizer from AdamW to SGD for better few-shot TTT performance
- Added adaptive TTT support to the SOTA training script
- Included budget-optimized deployment script and experiment logs/results