PR #1193

open

Non-record: Universal Transformer + Adaptive Density (val_bpb 1.4390)

by dentity007View on GitHub
val_bpb
1.4390
Architecture
Transformer
Optimizer
Artifact Size
2.87MB

Training Techniques

Architecture
depth recurrence
Single shared transformer block looped 12 times with per-iteration parameters.
parameters: {"layers":12}
weight tying
Shared-weight block reused across iterations.
parameters: null
iteration embeddings
Per-iteration parameters include an iteration embedding and scaling/mixing terms.
parameters: null
Other
other
Adaptive density curriculum that transitions from 50% sparse to dense during training.
parameters: {"sparse_ratio":0.5}
Test-Time Training
full TTT
parameters: null

Novel Contributions

  • Universal Transformer-style shared-weight block looped 12 times
  • Per-iteration parameters for recurrence (attn_scale, mlp_scale, resid_mix, iteration_embed)
  • 50% sparse-to-dense curriculum
  • Demonstration of depth recurrence benefits consistent with prior PR #363 findings