PR #1293

open

Non-record: Universal Transformer with Adaptive Computation Time

val_bpb

1.2409

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.79 MB

Training Techniques

Architecture

depth recurrence

Universal Transformer with shared layers reused across multiple passes, plus adaptive computation time halting.

parameters: {"layers":9,"passes":2}

weight tying

Tied embeddings are used.

parameters: null

GQA

Grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

RoPE

Rotary positional embeddings.

parameters: null

ReLU²

Squared ReLU MLP activation.

parameters: null

Regularization

logit softcap

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

Universal Transformer with Adaptive Computation Time (ACT)
Shared-layer depth recurrence to increase effective depth without increasing unique parameters
Per-token halting mechanism with ponder cost for adaptive multi-pass computation
Demonstration that recursion can improve BPB at equal parameter budget
Exploration of hybrid partial recursion as a next-step improvement