PR #1088

open

Non-Record Universal Transformer submission. (2x Attention layers, 3 Layer MLP, depth scheduling)

by serdardoesmlView on GitHub
val_bpb
1.2542
Architecture
Transformer
Optimizer
Artifact Size
15,982,324 bytes

Training Techniques

Architecture
depth recurrence
Shared Universal Transformer-style recurrent block with weights reused across depth.
parameters: {"layers":null}
attention
Two attention layers before the MLP to enable circuits like induction heads.
parameters: {"layers":2}
MLP3x
Three-layer MLP with an added fully connected layer between up and down projections.
parameters: {"layers":3}
weight tying
Main weights are shared across depth while norms remain independent.
parameters: null
bias to pre-norms
Added bias terms to pre-norm layers to act like a depth embedding.
parameters: null
Quantization
QAT
bits: 8
scope: model weights
int8
bits: 8
scope: final roundtrip
Compression
zlib
level: null
LR Schedule
layer/depth schedule
parameters: {"schedule":"0:2,2000:6"}
Other
other
Compiled all scheduled depths up front during warmup/priming to avoid recompiles when switching depths.
parameters: null
other
Removed U-Net style extra skip connections for simplicity.
parameters: null

Novel Contributions

  • Universal Transformer-style shared recurrent block with two attention layers before the MLP
  • Three-layer MLP instead of a standard wider MLP
  • Independent norms across depth with bias added to pre-norms as a depth embedding
  • Noisy QAT for quantized BPB improvement
  • Depth/layer scheduling with early low-depth training
  • Precompiling all scheduled depths during warmup to avoid recompiles
  • Removal of U-Net skip connections for shared-weight simplicity