PR #835

open

Progressive Depth Training — val_bpb 1.1980

by iverbovoyView on GitHub

val_bpb

1.1980

Architecture

Transformer

Optimizer

Muon + Adam

Artifact Size

12.83MB

Training Techniques

Architecture

depth recurrence

Replaced unique transformer blocks with shared blocks repeated multiple times to increase effective depth.

parameters: {"blocks":3,"repeats":4,"effective_layers":12}

Cross-Repeat Skip

Adds a weighted residual connection from each block's previous repeat output, making recurrence stateful.

parameters: null

XSA

Exclusive self-attention applied to the last 4 layers.

parameters: {"layers":4}

value embeddings

Two extra embedding tables mixed into the residual stream at each effective layer with learned scales.

parameters: {"tables":2}

loop embedding

Learned per-layer vector added before each block as depth-wise positional encoding.

parameters: null

KV head count

Uses 4 KV heads with 8 attention heads total.

parameters: {"heads":8,"kv_heads":4}

Optimizer

Muon + Adam

weight_decay: null

momentum: null

other_params: {"matrix_lr":0.012,"scalar_lr":0.012,"tied_embed_lr":0.015,"grad_clip_norm":0.3}

Weight Averaging

SWA

parameters: {"collected_only_at_full_depth":true}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"window":1024,"stride":256}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Quantization

int8

bits: 8

scope: all

Other

other

Progressive depth training schedule that increases recurrence depth during training from 2 repeats to 3 repeats to 4 repeats.

parameters: {"phases":[{"repeats":2,"eff_depth":6},{"repeats":3,"eff_depth":9},{"repeats":4,"eff_depth":12}]}

other

DDP race condition fix for phase switching using all_reduce synchronization across ranks.

parameters: null

Novel Contributions

Progressive depth training schedule that increases recurrence depth during training
DDP phase-switch synchronization fix using all_reduce
Stateful depth recurrence with Cross-Repeat Skip
Use of XSA in the last 4 layers
Value embeddings mixed into the residual stream
Loop embedding as depth-wise positional encoding