PR #697

open

feat: depth recurrence + cosine recovery TTT

by DanishlynxView on GitHub

val_bpb

1.1194

Architecture

—

Optimizer

—

Artifact Size

—

Training Techniques

Architecture

depth recurrence

Repeats layers 4-5 to create 13 virtual layers from 11 physical layers, with per-repetition learnable scale parameters and U-Net skip connections adapted for the virtual layer count.

parameters: {"repeat_layers":[4,5],"physical_layers":11,"virtual_layers":13}

Test-Time Training

score-first TTT

parameters: {"recovery_epochs":20,"recovery_lr":0.001}

LR Schedule

cosine recovery

parameters: {"epochs":20,"learning_rate":0.001}

Evaluation

sliding window eval

parameters: null

Other

other

Runs additional cosine-learning-rate epochs on all scored data after standard score-first TTT to repair int6 quantization damage, then re-scores with sliding window evaluation.

parameters: {"ttt_recovery_epochs":20,"ttt_recovery_lr":0.001}

Novel Contributions

Depth recurrence by repeating layers 4-5 to expand 11 physical layers into 13 virtual layers
Per-repetition learnable scale parameters for recurrent depth
U-Net skip connections adapted for the virtual layer count
Enhanced test-time training with a cosine recovery phase after score-first TTT
Recovery phase uses additional cosine-LR epochs on all scored data to repair int6 quantization damage
Fallback from FlashAttention 3 to SDPA for non-Hopper GPUs with manual GQA head repeat for PyTorch <2.5 compatibility