PR #739

open

T5: Phase-Based Depth Recurrence + MLA + Graduated Precision (Non-Record)

by Jonas-T5View on GitHub
val_bpb
1.5000
Architecture
Transformer
Optimizer
Muon + AdamW
Artifact Size
~13 MB

Training Techniques

Architecture
depth recurrence
8 unique transformer blocks are repeated across 4 specialized phases for 40 effective layers, with phase-specific specialization instead of uniform cycling.
parameters: {"unique_blocks":8,"phases":4,"repetitions":5,"effective_depth":40,"width":512}
MLA
Multi-Head Latent Attention uses low-rank KV compression to replace GQA and reduce attention parameters.
parameters: null
Quantization
mixed FP4/Int6 QAT
bits: null
scope: early-phase layers FP4, late-phase layers Int6
FP8
bits: 8
scope: all persistent state (master weights, optimizer momentum)
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"with":"AdamW","orthogonalization":"Newton-Schulz for 2D weights"}
Weight Averaging
EMA
parameters: null
Initialization
DeepNorm init
Output projections scaled by (8·N)^(-1/4) for deep stability.
Regularization
Z-Loss
parameters: null
QK-Clip
parameters: null

Novel Contributions

  • Phase-based depth recurrence with 8 unique blocks repeated across 4 specialized phases
  • 40 effective layers at full d=512 width with only 24M unique parameters
  • Multi-Head Latent Attention (MLA) for low-rank KV compression
  • Graduated precision scheme using FP4 for early layers and Int6 for late layers
  • FP8 training with stochastic rounding for persistent state on H100
  • Phase-specialized recurrence intended to outperform uniform ALBERT-style cycling