PR #1396

open

Record: Combined 3-Layer Recurrence + Parallel Residuals + Polar Express + Brotli — val_bpb 1.1067 (3-seed mean)

by erichroepkeView on GitHub
val_bpb
1.1067
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.87 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: all
Architecture
depth recurrence
3-layer recurrence applied after the max layer, replaying layers 3, 4, and 5.
parameters: {"layers":[3,4,5]}
parallel residuals
Attention and MLP are computed in parallel from a specified layer onward.
parameters: {"start_layer":7}
RoPE
Uses RoPE with reduced dimensions.
parameters: {"dimensions":16}
XSA
XSA applied across all layers.
parameters: {"layers":11}
U-Net skip connections
U-Net style skip connections with skip gates.
parameters: null
KV head count
Uses 4 KV heads.
parameters: {"kv_heads":4}
Value Residual
Value embeddings are used on later layers.
parameters: {"layers":[9,10]}
Optimizer
Muon
weight_decay: 0.105
momentum: null
other_params: {"muon_eq_r":1}
Compression
Brotli
level: null
Evaluation
sliding window eval
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Regularization
weight decay
parameters: {"value":0.105}
LN scale
parameters: {"enabled":true}
Other
other
Polar Express Newton-Schulz with minimax-optimal coefficients and MuonEq-R row-norm equalization before Newton-Schulz.
parameters: {"steps":4}

Novel Contributions

  • First submission combining 3-layer depth recurrence with parallel residuals.
  • Merged techniques from PR #1344 and PR #1392 into a single stack that had not been tested together.
  • Identified 2.13 MB of unused artifact headroom under the 16 MB cap.