PR #1541

open

Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT — val_bpb 1.07785 (3-seed mean)

val_bpb
1.0778
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB

Training Techniques

Architecture
parallel residuals
Cross-lane routing where attention and MLP outputs route to both lanes via learned scalars; final output uses the MLP lane.
parameters: {"start_layer":7,"new_scalar_params":66}
depth recurrence
Virtual layer recurrence reuses layers to create a deeper effective network.
parameters: {"physical_layers":11,"virtual_layers":17}
LeakyReLU
Uses LeakyReLU squared activation in the MLP.
parameters: {"slope":0.5,"power":2}
Partial RoPE
Applies rotary position embeddings to only part of the head dimensions.
parameters: {"dimensions":"16/64"}
Optimizer
Muon
weight_decay: 0.095
momentum: 0.97
other_params: {"matrix_lr":0.03}
Regularization
layerwise LN scale
parameters: null
logit softcap
parameters: {"value":30}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"epochs":3,"momentum":0.9}
Quantization
GPTQ
bits: 6
scope: matrices
int8
bits: 8
scope: embeddings
Weight Averaging
EMA
parameters: {"decay":0.9965}
LR Schedule
warmdown
parameters: {"warmdown":0.72}
Evaluation
sliding window eval
parameters: {"prefix_only":true}
Compression
Brotli
level: 11

Novel Contributions

  • Improved parallel residuals with learned cross-lane routing
  • Muon momentum reduced to 0.97 with retuned matrix learning rate 0.03
  • Legal score-first test-time training under Track B compliance
  • SP8192 with GPTQ SDClip and mixed int6/int8 artifact compression
  • 3-layer depth recurrence and tuned QK gain