PR #1293

open

Non-record: Universal Transformer with Adaptive Computation Time

val_bpb
1.2409
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.79 MB

Training Techniques

Architecture
depth recurrence
Universal Transformer with shared layers reused across multiple passes, plus adaptive computation time halting.
parameters: {"layers":9,"passes":2}
weight tying
Tied embeddings are used.
parameters: null
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
RoPE
Rotary positional embeddings.
parameters: null
ReLU²
Squared ReLU MLP activation.
parameters: null
Regularization
logit softcap
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null

Novel Contributions

  • Universal Transformer with Adaptive Computation Time (ACT)
  • Shared-layer depth recurrence to increase effective depth without increasing unique parameters
  • Per-token halting mechanism with ponder cost for adaptive multi-pass computation
  • Demonstration that recursion can improve BPB at equal parameter budget
  • Exploration of hybrid partial recursion as a next-step improvement