PR #1583

open

Record: SP8192 + Systems Optimization — val_bpb 1.0801 (3-seed mean)

by codemath3000View on GitHub
val_bpb
1.0801
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB

Training Techniques

Architecture
depth recurrence
Loops layers 3-5 to create virtual layers from physical layers.
parameters: {"layers":3,"virtual_layers":17,"physical_layers":11}
U-Net skip connections
Skip connections with sigmoid-gated U-Net style connections.
parameters: null
LeakyReLU
Uses LeakyReLU activation in the MLP.
parameters: {"slope":0.5}
Partial RoPE
Applies rotary position embeddings to a subset of dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
weight tying
Tied input and output embeddings.
parameters: null
Regularization
logit softcap
parameters: {"value":30}
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"newton_schulz_steps":5,"flat_buffer_all_reduce":true}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Evaluation
sliding window eval
parameters: {"batch_size":128}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"momentum":0.9,"epochs_per_chunk":3,"chunk_length":32000}
Quantization
GPTQ
bits: 6
scope: attention/MLP matrices
GPTQ
bits: 8
scope: embeddings
Compression
Brotli
level: 11

Novel Contributions

  • Systems-level optimization of the PR #1493 stack without changing the ML
  • Fused Muon kernel for faster training steps
  • Batched EMA updates using foreach ops
  • Muon preallocation and foreach weight updates
  • Superchunk sliding-window evaluation with strided views
  • Rank-0-only GPTQ serialization
  • Increased evaluation batch size to 128
  • Legal test-time training under the contest rules