PR #1193
openNon-record: Universal Transformer + Adaptive Density (val_bpb 1.4390)
by dentity007View on GitHub
val_bpb
1.4390
Architecture
Transformer
Optimizer
—
Artifact Size
2.87MB
Training Techniques
Architecture
depth recurrence
Single shared transformer block looped 12 times with per-iteration parameters.
parameters: {"layers":12}
weight tying
Shared-weight block reused across iterations.
parameters: null
iteration embeddings
Per-iteration parameters include an iteration embedding and scaling/mixing terms.
parameters: null
Other
other
Adaptive density curriculum that transitions from 50% sparse to dense during training.
parameters: {"sparse_ratio":0.5}
Test-Time Training
full TTT
parameters: null
Novel Contributions
- Universal Transformer-style shared-weight block looped 12 times
- Per-iteration parameters for recurrence (attn_scale, mlp_scale, resid_mix, iteration_embed)
- 50% sparse-to-dense curriculum
- Demonstration of depth recurrence benefits consistent with prior PR #363 findings