PR #1193

open

Non-record: Universal Transformer + Adaptive Density (val_bpb 1.4390)

val_bpb

1.4390

Architecture

Transformer

Optimizer

—

Artifact Size

2.87MB

Training Techniques

Architecture

depth recurrence

Single shared transformer block looped 12 times with per-iteration parameters.

parameters: {"layers":12}

weight tying

Shared-weight block reused across iterations.

parameters: null

iteration embeddings

Per-iteration parameters include an iteration embedding and scaling/mixing terms.

parameters: null

Other

other

Adaptive density curriculum that transitions from 50% sparse to dense during training.

parameters: {"sparse_ratio":0.5}

Test-Time Training

full TTT

parameters: null

Universal Transformer-style shared-weight block looped 12 times
Per-iteration parameters for recurrence (attn_scale, mlp_scale, resid_mix, iteration_embed)
50% sparse-to-dense curriculum
Demonstration of depth recurrence benefits consistent with prior PR #363 findings