PR #1472

open

Add 1.2066 record: 8L Depth Recurrence by trhgbao

by trhgbaoView on GitHub

val_bpb

1.2066

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.73 MB

Training Techniques

Architecture

depth recurrence

8 physical layers are looped twice to form 16 logical layers.

parameters: {"physical_layers":8,"loops":2,"logical_layers":16}

LeakyReLU

Uses LeakyReLU(0.5)^2 instead of standard ReLU^2 to improve gradient flow for negative pre-activations.

parameters: {"negative_slope":0.5}

Partial RoPE

Applies rotary positional embeddings to only part of the head dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

weight tying

Untied input and output embeddings for greater expressivity.

parameters: null

depth embeddings

Learned embeddings are added at each logical depth step to provide depth awareness.

parameters: {"logical_layers":16}

XSA

Cross-layer state aggregation with zero-initialized skip connections across logical layers.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: 0.95

other_params: {"muon_momentum_warmup_start":0.9,"muon_momentum_warmup_steps":1500}

Compression

zlib

level: null

Regularization

logit softcap

parameters: {"value":15}

LR Schedule

warmdown

parameters: {"warmdown_iters":3500,"warmup_steps":20}

Quantization

int8

bits: 8

scope: all

Novel Contributions

Depth recurrence with 8 physical layers looped into 16 logical layers
LeakyReLU(0.5)^2 activation to improve gradient flow
Partial RoPE on 1/4 of head dimensions
Untied input and output embeddings
Depth embeddings for logical layer awareness
Cross-layer state aggregation with zero-initialized skips
Muon optimization with hardware-aligned MLP multiplier
Int8 quantization plus zlib compression to fit the 16MB track