PR #1110
openNotable Non-Record: Universal Transformer — 1.2249 BPB — Depth Recurrence with Iteration Embeddings
by gowtham0992View on GitHub
val_bpb
1.2249
Architecture
Transformer
Optimizer
—
Artifact Size
4.95 MB
Training Techniques
Architecture
depth recurrence
3 unique transformer blocks are shared across 4 iterations for 12 effective layers.
parameters: {"unique_blocks":3,"iterations":4,"effective_layers":12}
weight tying
The same block weights are reused across multiple effective layers/iterations.
parameters: {"shared_blocks":3}
U-Net skip connections
U-Net-style skip connections adapted for the looped recurrent structure.
parameters: null
GQA
Uses grouped query attention with 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
LeakyReLU
MLP uses LeakyReLU(0.5)^2.
parameters: {"slope":0.5}
BigramHash
BigramHash feature with 2048 buckets.
parameters: {"dimensions":2048}
SmearGate
SmearGate is included as part of the model design.
parameters: null
iteration embeddings
Learnable per-iteration embeddings are added before each block execution to distinguish recurrence steps.
parameters: {"vectors":12,"dimension":512}
iteration scales
Learnable per-iteration scales modulate residual contribution per effective layer.
parameters: {"vectors":12,"dimension":512}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: null
Quantization
GPTQ
bits: 6
scope: model weights
Evaluation
sliding window eval
parameters: null
Novel Contributions
- Adds per-iteration learnable embeddings from the Universal Transformer paper to depth recurrence.
- Adds per-iteration learnable scales to vary residual impact across iterations.
- Uses 3 shared transformer blocks across 4 iterations to achieve 12 effective layers with large parameter savings.
- Produces a much smaller artifact while remaining reproducible from the provided training script.