PR #2107

open

[Non-record] 1.1999 BPB on a single H100 GPU (~1.064 BPB on 8x): Adaptive Recurrent Transformer Architecture

by TidalTunesView on GitHub

val_bpb

1.0643

Architecture

Transformer

Optimizer

—

Artifact Size

—

Training Techniques

Architecture

depth recurrence

Adaptive recurrent transformer with routing MLPs that decide when/how to recurse through additional transformer blocks.

parameters: {"short_branch":"skip extra recurrent cycles","full_branch":"run original recurrent path"}

Gated Attention

Uses gated attention variants in selected ART implementations.

parameters: {"enabled":true}

MoE

Includes MoE-router based ART variants and sparse-head / shared-MoE implementations.

parameters: null

RoPE

Includes a saved-RoPE variant with a fix applied in one ART branch.

parameters: null

Quantization

GPTQ

bits: null

scope: model

Test-Time Training

score-first TTT

parameters: {"phased":true,"lora_rank":80,"chunk_size":48}

Regularization

weight decay

parameters: {"ttt_weight_decay":0.5}

logit softcap

parameters: null

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85}

Weight Averaging

EMA

parameters: {"activate_at":0.55}

Other

other

CaseOps / SmearGate / LQER / phased-TTT stack used in the core example.

parameters: {"caseops_enabled":true,"smear_gate_enabled":true,"lqer_enabled":true,"phased_ttt_enabled":true}

Novel Contributions

Adaptive Recurrent Transformer architecture with learned recurrent-depth control
Static-graph Simple ART branch selection for efficient distributed execution
Soft ART and other ART variants with routing/MoE-style recurrence control
Demonstration of a non-record research package documenting ART directions and evidence logs