PR #1110

open

Notable Non-Record: Universal Transformer — 1.2249 BPB — Depth Recurrence with Iteration Embeddings

by gowtham0992View on GitHub

val_bpb

1.2249

Architecture

Transformer

Optimizer

—

Artifact Size

4.95 MB

Training Techniques

Architecture

depth recurrence

3 unique transformer blocks are shared across 4 iterations for 12 effective layers.

parameters: {"unique_blocks":3,"iterations":4,"effective_layers":12}

weight tying

The same block weights are reused across multiple effective layers/iterations.

parameters: {"shared_blocks":3}

U-Net skip connections

U-Net-style skip connections adapted for the looped recurrent structure.

parameters: null

GQA

Uses grouped query attention with 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

LeakyReLU

MLP uses LeakyReLU(0.5)^2.

parameters: {"slope":0.5}

BigramHash

BigramHash feature with 2048 buckets.

parameters: {"dimensions":2048}

SmearGate

SmearGate is included as part of the model design.

parameters: null

iteration embeddings

Learnable per-iteration embeddings are added before each block execution to distinguish recurrence steps.

parameters: {"vectors":12,"dimension":512}

iteration scales

Learnable per-iteration scales modulate residual contribution per effective layer.

parameters: {"vectors":12,"dimension":512}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: null

Quantization

GPTQ

bits: 6

scope: model weights

Evaluation

sliding window eval

parameters: null

Novel Contributions

Adds per-iteration learnable embeddings from the Universal Transformer paper to depth recurrence.
Adds per-iteration learnable scales to vary residual impact across iterations.
Uses 3 shared transformer blocks across 4 iterations to achieve 12 effective layers with large parameter savings.
Produces a much smaller artifact while remaining reproducible from the provided training script.