PR #1088

open

Non-Record Universal Transformer submission. (2x Attention layers, 3 Layer MLP, depth scheduling)

by serdardoesmlView on GitHub

val_bpb

1.2542

Architecture

Transformer

Optimizer

—

Artifact Size

15,982,324 bytes

Training Techniques

Architecture

depth recurrence

Shared Universal Transformer-style recurrent block with weights reused across depth.

parameters: {"layers":null}

attention

Two attention layers before the MLP to enable circuits like induction heads.

parameters: {"layers":2}

MLP3x

Three-layer MLP with an added fully connected layer between up and down projections.

parameters: {"layers":3}

weight tying

Main weights are shared across depth while norms remain independent.

parameters: null

bias to pre-norms

Added bias terms to pre-norm layers to act like a depth embedding.

parameters: null

Quantization

QAT

bits: 8

scope: model weights

int8

bits: 8

scope: final roundtrip

Compression

zlib

level: null

LR Schedule

layer/depth schedule

parameters: {"schedule":"0:2,2000:6"}

Other

other

Compiled all scheduled depths up front during warmup/priming to avoid recompiles when switching depths.

parameters: null

other

Removed U-Net style extra skip connections for simplicity.

parameters: null

Novel Contributions

Universal Transformer-style shared recurrent block with two attention layers before the MLP
Three-layer MLP instead of a standard wider MLP
Independent norms across depth with bias added to pre-norms as a depth embedding
Noisy QAT for quantized BPB improvement
Depth/layer scheduling with early low-depth training
Precompiling all scheduled depths during warmup to avoid recompiles
Removal of U-Net skip connections for shared-weight simplicity