PR #91

open

Depth recurrence: 3 unique layers x 3 loops, 1.589 BPB

by koushikkethamakkaView on GitHub

val_bpb

1.5890

Architecture

Transformer

Optimizer

—

Artifact Size

20.4MB

Training Techniques

Architecture

depth recurrence

Uses 3 unique transformer blocks looped 3 times instead of 9 unique blocks, keeping effective depth while reducing unique parameters.

parameters: {"unique_layers":3,"recurrence_count":3}

Compression

custom

level: null

Other

other

Wider models are used by reallocating parameter budget saved from recurrence; experiments sweep widths, layers, recurrence counts, and head counts.

parameters: {"width_range":[512,1152],"layers_range":[2,6],"recurrence_range":[2,6],"head_range":[4,16]}

Introduces depth recurrence with 3 unique transformer layers looped 3 times.
Reallocates saved parameter budget to wider layers.
Reports that 3 unique layers with 3 recurrences is the best-performing shape among tested configurations.
Finds that wider models perform better with sufficient data, with d1024 outperforming d896 and d768.
Identifies head-count preferences at different widths, such as 8 heads at d1024 and 12 heads at d768.