PR #739

open

T5: Phase-Based Depth Recurrence + MLA + Graduated Precision (Non-Record)

by Jonas-T5View on GitHub

val_bpb

1.5000

Architecture

Transformer

Optimizer

Muon + AdamW

Artifact Size

~13 MB

Training Techniques

Architecture

depth recurrence

8 unique transformer blocks are repeated across 4 specialized phases for 40 effective layers, with phase-specific specialization instead of uniform cycling.

parameters: {"unique_blocks":8,"phases":4,"repetitions":5,"effective_depth":40,"width":512}

MLA

Multi-Head Latent Attention uses low-rank KV compression to replace GQA and reduce attention parameters.

parameters: null

Quantization

mixed FP4/Int6 QAT

bits: null

scope: early-phase layers FP4, late-phase layers Int6

FP8

bits: 8

scope: all persistent state (master weights, optimizer momentum)

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"with":"AdamW","orthogonalization":"Newton-Schulz for 2D weights"}

Weight Averaging

EMA

parameters: null

Initialization

DeepNorm init

Output projections scaled by (8·N)^(-1/4) for deep stability.

Regularization

Z-Loss

parameters: null

QK-Clip

parameters: null

Novel Contributions

Phase-based depth recurrence with 8 unique blocks repeated across 4 specialized phases
40 effective layers at full d=512 width with only 24M unique parameters
Multi-Head Latent Attention (MLA) for low-rank KV compression
Graduated precision scheme using FP4 for early layers and Int6 for late layers
FP8 training with stochastic rounding for persistent state on H100
Phase-specialized recurrence intended to outperform uniform ALBERT-style cycling