PR #1780

open

Add progressive recurrence SP8192 record submission

by wisebreadloafView on GitHub
val_bpb
1.0806
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: matrices
GPTQ
bits: 8
scope: embeddings
Architecture
depth recurrence
Progressive 3-layer recurrence schedule with a partial recurrence phase before full recurrence activation.
parameters: {"layers":3,"phase1_frac":0.35,"phase2_frac":0.55}
weight tying
Tied embeddings are used.
parameters: null
LeakyReLU
LeakyReLU activation is used in the MLP.
parameters: {"slope":0.5}
Partial RoPE
Rotary position embeddings are applied partially.
parameters: {"dimensions":16}
GQA
Grouped-query attention with 4 KV heads.
parameters: {"kv_heads":4}
U-Net skip connections
Skip gates / U-Net-style skip connections are enabled.
parameters: null
parallel residuals
Attention and MLP read from the same pre-residual input in later layers.
parameters: {"start_layer":7}
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"variant":"MuonEq-R","newton_schulz_steps":5}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"embeddings/scalars"}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: null
Test-Time Training
score-first TTT
parameters: {"chunk_size":32000,"epochs":3,"learning_rate":0.005,"momentum":0.9,"gradient_clip":1}
LR Schedule
cosine decay
parameters: null
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: null

Novel Contributions

  • Progressive recurrence schedule for the SP8192 3-layer recurrence stack
  • Verified 3-seed record submission under the 16 MB limit
  • Packed submission path with exact train_gpt.py included
  • Combination of SP8192, parallel residuals, QK-gain 5.25, and legal score-first TTT