PR #1944

open

adaLN_recurrence [val_bpb=1.255 on 4 x H100]

by dmitriymyan1View on GitHub
val_bpb
1.2551
Architecture
Transformer
Optimizer
Artifact Size
15.26 MB

Training Techniques

Architecture
depth recurrence
Adds recurrence to the Parallel Residuals + Mini Depth Recurrence baseline, with weight-tied recurrent layers distinguishing first vs second pass.
parameters: {"layers":[4,5]}
weight tying
Uses weight-tied recurrent layers so the same layers are reused across recurrence passes.
parameters: {"layers":[4,5]}
adaLN
Applies adaptive layer norm conditioned on recurrence iteration via lightweight per-channel affine modulation.
parameters: null
weight tying
Weight-tied recurrent layers are modulated to distinguish first and second passes.
parameters: null
Initialization
zero init
Zero-initialized projection so training starts identically to the baseline.

Novel Contributions

  • Adds adaLN conditioned on recurrence iteration to the Parallel Residuals + Mini Depth Recurrence baseline
  • Enables tied recurrent layers to distinguish first vs second pass with lightweight per-channel affine modulation
  • Uses zero-initialized projection to preserve baseline initialization behavior
  • Reports a smoke-test result of 1.2551 val_bpb on 4xH100