PR #697

open

feat: depth recurrence + cosine recovery TTT

by DanishlynxView on GitHub
val_bpb
1.1194
Architecture
Optimizer
Artifact Size

Training Techniques

Architecture
depth recurrence
Repeats layers 4-5 to create 13 virtual layers from 11 physical layers, with per-repetition learnable scale parameters and U-Net skip connections adapted for the virtual layer count.
parameters: {"repeat_layers":[4,5],"physical_layers":11,"virtual_layers":13}
Test-Time Training
score-first TTT
parameters: {"recovery_epochs":20,"recovery_lr":0.001}
LR Schedule
cosine recovery
parameters: {"epochs":20,"learning_rate":0.001}
Evaluation
sliding window eval
parameters: null
Other
other
Runs additional cosine-learning-rate epochs on all scored data after standard score-first TTT to repair int6 quantization damage, then re-scores with sliding window evaluation.
parameters: {"ttt_recovery_epochs":20,"ttt_recovery_lr":0.001}

Novel Contributions

  • Depth recurrence by repeating layers 4-5 to expand 11 physical layers into 13 virtual layers
  • Per-repetition learnable scale parameters for recurrent depth
  • U-Net skip connections adapted for the virtual layer count
  • Enhanced test-time training with a cosine recovery phase after score-first TTT
  • Recovery phase uses additional cosine-LR epochs on all scored data to repair int6 quantization damage
  • Fallback from FlashAttention 3 to SDPA for non-Hopper GPUs with manual GQA head repeat for PyTorch <2.5 compatibility