PR #1628

open

SP8192 Depth Recurrence + Parallel Residuals + TTT (1.1921 BPB)

by yu314-coderView on GitHub

val_bpb

1.1921

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.79 MB

Training Techniques

Architecture

depth recurrence

Layers 3-5 are shared and looped 3 times to create more virtual depth than physical layers.

parameters: {"physical_layers":11,"virtual_layers":17,"shared_layers":[3,4,5],"loops":3}

Parallel residuals

GPT-J style parallel residual connections used for later layers.

parameters: {"start_layer":7}

Partial RoPE

Only a subset of head dimensions use rotary embeddings; the rest remain position-free.

parameters: {"dimensions":16,"total_dimensions":64}

LeakyReLU

Uses LeakyReLU squared activation.

parameters: {"slope":0.5}

KV head count

Uses grouped key/value heads in the attention stack.

parameters: {"heads":8,"kv_heads":4}

U-Net skip connections

Encoder-decoder style skip connections with sigmoid-gated skip weights.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.9965,"start_fraction":0.5}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"scope":"matrices"}

Adam

weight_decay: null

momentum: null

other_params: {"scope":"embeddings/scalars"}

Compression

zlib

level: 9

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"epochs":3,"momentum":0.9}

LR Schedule

warmdown

parameters: {"fraction":0.72}

Quantization

int8

bits: 8

scope: per-row

Sequence Length

sequence_length

train_length: 524288

eval_length: null

Novel Contributions

SP8192 tokenizer for improved compression per byte
Depth recurrence with 11 physical layers expanded to 17 virtual layers
GPT-J style parallel residuals in later layers
Partial RoPE applied to 16/64 head dimensions
EMA with delayed start during training
Score-first chunk-based test-time training
Muon optimizer for matrix parameters with Adam for scalars
Artifact compressed to fit under the 16MB limit