val_bpb
1.2551
Architecture
Transformer
Optimizer
—
Artifact Size
15.26 MB
Training Techniques
Architecture
depth recurrence
Adds recurrence to the Parallel Residuals + Mini Depth Recurrence baseline, with weight-tied recurrent layers distinguishing first vs second pass.
parameters: {"layers":[4,5]}
weight tying
Uses weight-tied recurrent layers so the same layers are reused across recurrence passes.
parameters: {"layers":[4,5]}
adaLN
Applies adaptive layer norm conditioned on recurrence iteration via lightweight per-channel affine modulation.
parameters: null
weight tying
Weight-tied recurrent layers are modulated to distinguish first and second passes.
parameters: null
Initialization
zero init
Zero-initialized projection so training starts identically to the baseline.
Novel Contributions
- Adds adaLN conditioned on recurrence iteration to the Parallel Residuals + Mini Depth Recurrence baseline
- Enables tied recurrent layers to distinguish first vs second pass with lightweight per-channel affine modulation
- Uses zero-initialized projection to preserve baseline initialization behavior
- Reports a smoke-test result of 1.2551 val_bpb on 4xH100