PR #895

open

Non-record: 4-Hour Progressive Depth — val_bpb 1.0889

by iverbovoyView on GitHub

val_bpb

1.0889

Architecture

Transformer

Optimizer

Muon

Artifact Size

12.83MB

Training Techniques

Architecture

depth recurrence

Replaced unique transformer blocks with 3 shared blocks repeated across depth, using progressive repeats to reach 15 effective layers.

parameters: {"shared_blocks":3,"repeats":[2,3,4,5],"effective_layers":15}

weight tying

Shared weights across repeated blocks instead of unique layers.

parameters: null

U-Net skip connections

Removed baseline U-Net skip connections; Cross-Repeat Skip was used instead.

parameters: null

Value Residual

Added value embeddings mixed into the residual stream at each effective layer.

parameters: {"tables":2}

other

Cross-Repeat Skip: each block receives a weighted residual from its output in the previous repeat, making recurrence stateful.

parameters: {"learned_scales":true}

other

Loop embedding: learned per-layer vector added before each block as depth-wise positional encoding.

parameters: null

Quantization

int8

bits: 8

scope: model weights

Weight Averaging

SWA

parameters: {"checkpoints":38}

Evaluation

sliding window eval

parameters: {"stride":256,"window":1024}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"Adam":true}

Compression

zlib

level: null

Novel Contributions

Progressive depth recurrence scaling study with shared-weight recurrence
Cross-Repeat Skip to make recurrence stateful
Value embeddings mixed into the residual stream
Loop embedding as depth-wise positional encoding
Large-scale SWA over 38 checkpoints
Hedge Mixer evaluation adapted from prior submissions