PR #1396

open

Record: Combined 3-Layer Recurrence + Parallel Residuals + Polar Express + Brotli — val_bpb 1.1067 (3-seed mean)

by erichroepkeView on GitHub

val_bpb

1.1067

Architecture

Transformer

Optimizer

Muon

Artifact Size

13.87 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

Architecture

depth recurrence

3-layer recurrence applied after the max layer, replaying layers 3, 4, and 5.

parameters: {"layers":[3,4,5]}

parallel residuals

Attention and MLP are computed in parallel from a specified layer onward.

parameters: {"start_layer":7}

RoPE

Uses RoPE with reduced dimensions.

parameters: {"dimensions":16}

XSA

XSA applied across all layers.

parameters: {"layers":11}

U-Net skip connections

U-Net style skip connections with skip gates.

parameters: null

KV head count

Uses 4 KV heads.

parameters: {"kv_heads":4}

Value Residual

Value embeddings are used on later layers.

parameters: {"layers":[9,10]}

Optimizer

Muon

weight_decay: 0.105

momentum: null

other_params: {"muon_eq_r":1}

Compression

Brotli

level: null

Evaluation

sliding window eval

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Regularization

weight decay

parameters: {"value":0.105}

LN scale

parameters: {"enabled":true}

Other

other

Polar Express Newton-Schulz with minimax-optimal coefficients and MuonEq-R row-norm equalization before Newton-Schulz.

parameters: {"steps":4}

Novel Contributions

First submission combining 3-layer depth recurrence with parallel residuals.
Merged techniques from PR #1344 and PR #1392 into a single stack that had not been tested together.
Identified 2.13 MB of unused artifact headroom under the 16 MB cap.