PR #1583

open

Record: SP8192 + Systems Optimization — val_bpb 1.0801 (3-seed mean)

by codemath3000View on GitHub

val_bpb

1.0801

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.99 MB

Training Techniques

Architecture

depth recurrence

Loops layers 3-5 to create virtual layers from physical layers.

parameters: {"layers":3,"virtual_layers":17,"physical_layers":11}

U-Net skip connections

Skip connections with sigmoid-gated U-Net style connections.

parameters: null

LeakyReLU

Uses LeakyReLU activation in the MLP.

parameters: {"slope":0.5}

Partial RoPE

Applies rotary position embeddings to a subset of dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

weight tying

Tied input and output embeddings.

parameters: null

Regularization

logit softcap

parameters: {"value":30}

Optimizer

Muon

weight_decay: 0.095

momentum: null

other_params: {"newton_schulz_steps":5,"flat_buffer_all_reduce":true}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Evaluation

sliding window eval

parameters: {"batch_size":128}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"momentum":0.9,"epochs_per_chunk":3,"chunk_length":32000}

Quantization

GPTQ

bits: 6

scope: attention/MLP matrices

GPTQ

bits: 8

scope: embeddings

Compression

Brotli

level: 11

Novel Contributions

Systems-level optimization of the PR #1493 stack without changing the ML
Fused Muon kernel for faster training steps
Batched EMA updates using foreach ops
Muon preallocation and foreach weight updates
Superchunk sliding-window evaluation with strided views
Rank-0-only GPTQ serialization
Increased evaluation batch size to 128
Legal test-time training under the contest rules