PR #1893

open

Record: L_ParallelResiduals_UNetSkips_DepthRecur_Muon_ 1.0901

by HieuabssyView on GitHub

val_bpb

1.0901

Architecture

GPT-2

Optimizer

Muon

Artifact Size

15,976,317 bytes

Training Techniques

Architecture

U-Net skip connections

Encoder-decoder style skip connections with learnable gating to inject shallow states into deeper layers.

parameters: {"layers":null}

parallel residuals

Splits later layers into parallel attention and MLP lanes and merges them with a learnable gate.

parameters: {"parallel_start_layer":7}

depth recurrence

Repeatedly traverses selected middle layers multiple times in a single forward pass.

parameters: {"layers":[3,4,5],"start_step":3000}

Value Residual

Value embedding enhancement that injects extra representations into attention for selected layers.

parameters: {"layers":[9,10]}

Optimizer

Muon

weight_decay: 0.095

momentum: 0.99

other_params: {"ns_steps":5,"warmup_momentum_start":0.92,"warmup_steps":1500}

AdamW

weight_decay: null

momentum: null

other_params: {"beta1":0.9,"beta2":0.95,"eps":1e-8,"scalar_lr":0.02}

Quantization

GPTQ

bits: 6

scope: all

Regularization

weight decay

parameters: {"value":0.095}

Compression

custom

level: 11

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":2048}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"fraction":0.667}

Novel Contributions

U-Net style skip connections with learnable gating
Parallel residual lanes for attention and MLP in later layers
Depth recurrence over middle layers
Value embedding enhancements in attention
Muon optimization for matrix weights with Newton-Schulz steps
GPTQ 6-bit post-training quantization
Selective pruning to fit under the 16MB artifact limit
Brotli-based artifact compression
Sliding window evaluation with stride 64