val_bpb
1.5890
Architecture
Transformer
Optimizer
—
Artifact Size
20.4MB
Training Techniques
Architecture
depth recurrence
Uses 3 unique transformer blocks looped 3 times instead of 9 unique blocks, keeping effective depth while reducing unique parameters.
parameters: {"unique_layers":3,"recurrence_count":3}
Compression
custom
level: null
Other
other
Wider models are used by reallocating parameter budget saved from recurrence; experiments sweep widths, layers, recurrence counts, and head counts.
parameters: {"width_range":[512,1152],"layers_range":[2,6],"recurrence_range":[2,6],"head_range":[4,16]}
Novel Contributions
- Introduces depth recurrence with 3 unique transformer layers looped 3 times.
- Reallocates saved parameter budget to wider layers.
- Reports that 3 unique layers with 3 recurrences is the best-performing shape among tested configurations.
- Finds that wider models perform better with sufficient data, with d1024 outperforming d896 and d768.
- Identifies head-count preferences at different widths, such as 8 heads at d1024 and 12 heads at d768.