PR #2107

open

[Non-record] 1.1999 BPB on a single H100 GPU (~1.064 BPB on 8x): Adaptive Recurrent Transformer Architecture

by TidalTunesView on GitHub
val_bpb
1.0643
Architecture
Transformer
Optimizer
Artifact Size

Training Techniques

Architecture
depth recurrence
Adaptive recurrent transformer with routing MLPs that decide when/how to recurse through additional transformer blocks.
parameters: {"short_branch":"skip extra recurrent cycles","full_branch":"run original recurrent path"}
Gated Attention
Uses gated attention variants in selected ART implementations.
parameters: {"enabled":true}
MoE
Includes MoE-router based ART variants and sparse-head / shared-MoE implementations.
parameters: null
RoPE
Includes a saved-RoPE variant with a fix applied in one ART branch.
parameters: null
Quantization
GPTQ
bits: null
scope: model
Test-Time Training
score-first TTT
parameters: {"phased":true,"lora_rank":80,"chunk_size":48}
Regularization
weight decay
parameters: {"ttt_weight_decay":0.5}
logit softcap
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85}
Weight Averaging
EMA
parameters: {"activate_at":0.55}
Other
other
CaseOps / SmearGate / LQER / phased-TTT stack used in the core example.
parameters: {"caseops_enabled":true,"smear_gate_enabled":true,"lqer_enabled":true,"phased_ttt_enabled":true}

Novel Contributions

  • Adaptive Recurrent Transformer architecture with learned recurrent-depth control
  • Static-graph Simple ART branch selection for efficient distributed execution
  • Soft ART and other ART variants with routing/MoE-style recurrence control
  • Demonstration of a non-record research package documenting ART directions and evidence logs