PR #1944

open

adaLN_recurrence [val_bpb=1.255 on 4 x H100]

by dmitriymyan1View on GitHub

val_bpb

1.2551

Architecture

Transformer

Optimizer

—

Artifact Size

15.26 MB

Training Techniques

Architecture

depth recurrence

Adds recurrence to the Parallel Residuals + Mini Depth Recurrence baseline, with weight-tied recurrent layers distinguishing first vs second pass.

parameters: {"layers":[4,5]}

weight tying

Uses weight-tied recurrent layers so the same layers are reused across recurrence passes.

parameters: {"layers":[4,5]}

adaLN

Applies adaptive layer norm conditioned on recurrence iteration via lightweight per-channel affine modulation.

parameters: null

weight tying

Weight-tied recurrent layers are modulated to distinguish first and second passes.

parameters: null

Initialization

zero init

Zero-initialized projection so training starts identically to the baseline.

Adds adaLN conditioned on recurrence iteration to the Parallel Residuals + Mini Depth Recurrence baseline
Enables tied recurrent layers to distinguish first vs second pass with lightweight per-channel affine modulation
Uses zero-initialized projection to preserve baseline initialization behavior
Reports a smoke-test result of 1.2551 val_bpb on 4xH100