PR #91

open

Depth recurrence: 3 unique layers x 3 loops, 1.589 BPB

by koushikkethamakkaView on GitHub
val_bpb
1.5890
Architecture
Transformer
Optimizer
Artifact Size
20.4MB

Training Techniques

Architecture
depth recurrence
Uses 3 unique transformer blocks looped 3 times instead of 9 unique blocks, keeping effective depth while reducing unique parameters.
parameters: {"unique_layers":3,"recurrence_count":3}
Compression
custom
level: null
Other
other
Wider models are used by reallocating parameter budget saved from recurrence; experiments sweep widths, layers, recurrence counts, and head counts.
parameters: {"width_range":[512,1152],"layers_range":[2,6],"recurrence_range":[2,6],"head_range":[4,16]}

Novel Contributions

  • Introduces depth recurrence with 3 unique transformer layers looped 3 times.
  • Reallocates saved parameter budget to wider layers.
  • Reports that 3 unique layers with 3 recurrences is the best-performing shape among tested configurations.
  • Finds that wider models perform better with sufficient data, with d1024 outperforming d896 and d768.
  • Identifies head-count preferences at different widths, such as 8 heads at d1024 and 12 heads at d768.