PR #1541

open

Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT — val_bpb 1.07785 (3-seed mean)

by bigbagView on GitHub

val_bpb

1.0778

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.99 MB

Training Techniques

Architecture

parallel residuals

Cross-lane routing where attention and MLP outputs route to both lanes via learned scalars; final output uses the MLP lane.

parameters: {"start_layer":7,"new_scalar_params":66}

depth recurrence

Virtual layer recurrence reuses layers to create a deeper effective network.

parameters: {"physical_layers":11,"virtual_layers":17}

LeakyReLU

Uses LeakyReLU squared activation in the MLP.

parameters: {"slope":0.5,"power":2}

Partial RoPE

Applies rotary position embeddings to only part of the head dimensions.

parameters: {"dimensions":"16/64"}

Optimizer

Muon

weight_decay: 0.095

momentum: 0.97

other_params: {"matrix_lr":0.03}

Regularization

layerwise LN scale

parameters: null

logit softcap

parameters: {"value":30}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"epochs":3,"momentum":0.9}

Quantization

GPTQ

bits: 6

scope: matrices

int8

bits: 8

scope: embeddings

Weight Averaging

EMA

parameters: {"decay":0.9965}

LR Schedule

warmdown

parameters: {"warmdown":0.72}

Evaluation

sliding window eval

parameters: {"prefix_only":true}

Compression

Brotli

level: 11

Novel Contributions

Improved parallel residuals with learned cross-lane routing
Muon momentum reduced to 0.97 with retuned matrix learning rate 0.03
Legal score-first test-time training under Track B compliance
SP8192 with GPTQ SDClip and mixed int6/int8 artifact compression
3-layer depth recurrence and tuned QK gain