PR #1584

open

Record: Improved Parallel Residuals + Systems Optimization — val_bpb 1.0752 (3-seed mean)

by codemath3000View on GitHub

val_bpb

1.0752

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.98 MB

Training Techniques

Architecture

weight tying

Tied embeddings are used in the base architecture.

parameters: null

LeakyReLU

Uses LeakyReLU activation in the MLP.

parameters: {"slope":0.5}

Partial RoPE

Applies partial rotary positional embeddings.

parameters: {"dimensions":"16/64"}

depth recurrence

Loops layers 3-5 with recurrence activated at a fraction of training.

parameters: {"layers":[3,5],"activated_at_frac":0.35}

U-Net skip connections

Dual-lane parallel residual architecture with lane-specific skip behavior.

parameters: {"start_layer":8}

layerwise LN scale

Uses layerwise layer-norm scaling.

parameters: null

Regularization

logit softcap

parameters: {"value":30}

Optimizer

Muon

weight_decay: null

momentum: 0.97

other_params: {"sharded_reduce_scatter_all_gather":true,"newton_schulz_steps":5}

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":"embeddings/scalars"}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Quantization

GPTQ

bits: 6

scope: attention/MLP matrices

GPTQ

bits: 8

scope: token embeddings

Test-Time Training

score-first TTT

parameters: {"chunk_size":32000,"epochs_per_chunk":3,"learning_rate":0.01,"momentum":0.9}

Evaluation

sliding window eval

parameters: null

LR Schedule

warmdown

parameters: {"warmdown_frac":0.667}

Sequence Length

sequence_length

train_length: 8192

eval_length: 32000

Other

other

Systems-level throughput optimizations: fused Muon kernel, batched EMA, and reusable numpy loader preallocation.

parameters: {"fused_muon_kernel":true,"batched_ema":true,"loader_prealloc":true}

Novel Contributions

Systems-level optimization of the dual-lane parallel residual architecture without changing the ML
Fused Muon kernel combining momentum update, Nesterov extrapolation, row normalization, and Newton-Schulz orthogonalization
Batched EMA using foreach operations
Reusable numpy preallocated data loader buffer
Extra training steps within the same 600s budget due to improved throughput
Mixed int6/int8 GPTQ quantization with SDClip and byte-shuffle/Brotli artifact compression
Score-first chunk-based TTT with legal causal evaluation