PR #1780

open

Add progressive recurrence SP8192 record submission

by wisebreadloafView on GitHub

val_bpb

1.0806

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.99 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: matrices

GPTQ

bits: 8

scope: embeddings

Architecture

depth recurrence

Progressive 3-layer recurrence schedule with a partial recurrence phase before full recurrence activation.

parameters: {"layers":3,"phase1_frac":0.35,"phase2_frac":0.55}

weight tying

Tied embeddings are used.

parameters: null

LeakyReLU

LeakyReLU activation is used in the MLP.

parameters: {"slope":0.5}

Partial RoPE

Rotary position embeddings are applied partially.

parameters: {"dimensions":16}

GQA

Grouped-query attention with 4 KV heads.

parameters: {"kv_heads":4}

U-Net skip connections

Skip gates / U-Net-style skip connections are enabled.

parameters: null

parallel residuals

Attention and MLP read from the same pre-residual input in later layers.

parameters: {"start_layer":7}

Optimizer

Muon

weight_decay: 0.095

momentum: null

other_params: {"variant":"MuonEq-R","newton_schulz_steps":5}

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":"embeddings/scalars"}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: null

Test-Time Training

score-first TTT

parameters: {"chunk_size":32000,"epochs":3,"learning_rate":0.005,"momentum":0.9,"gradient_clip":1}

LR Schedule

cosine decay

parameters: null

Regularization

logit softcap

parameters: {"value":30}

layerwise LN scale

parameters: null

Novel Contributions

Progressive recurrence schedule for the SP8192 3-layer recurrence stack
Verified 3-seed record submission under the 16 MB limit
Packed submission path with exact train_gpt.py included
Combination of SP8192, parallel residuals, QK-gain 5.25, and legal score-first TTT