PR #1584

open

Record: Improved Parallel Residuals + Systems Optimization — val_bpb 1.0752 (3-seed mean)

by codemath3000View on GitHub
val_bpb
1.0752
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.98 MB

Training Techniques

Architecture
weight tying
Tied embeddings are used in the base architecture.
parameters: null
LeakyReLU
Uses LeakyReLU activation in the MLP.
parameters: {"slope":0.5}
Partial RoPE
Applies partial rotary positional embeddings.
parameters: {"dimensions":"16/64"}
depth recurrence
Loops layers 3-5 with recurrence activated at a fraction of training.
parameters: {"layers":[3,5],"activated_at_frac":0.35}
U-Net skip connections
Dual-lane parallel residual architecture with lane-specific skip behavior.
parameters: {"start_layer":8}
layerwise LN scale
Uses layerwise layer-norm scaling.
parameters: null
Regularization
logit softcap
parameters: {"value":30}
Optimizer
Muon
weight_decay: null
momentum: 0.97
other_params: {"sharded_reduce_scatter_all_gather":true,"newton_schulz_steps":5}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"embeddings/scalars"}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Quantization
GPTQ
bits: 6
scope: attention/MLP matrices
GPTQ
bits: 8
scope: token embeddings
Test-Time Training
score-first TTT
parameters: {"chunk_size":32000,"epochs_per_chunk":3,"learning_rate":0.01,"momentum":0.9}
Evaluation
sliding window eval
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_frac":0.667}
Sequence Length
sequence_length
train_length: 8192
eval_length: 32000
Other
other
Systems-level throughput optimizations: fused Muon kernel, batched EMA, and reusable numpy loader preallocation.
parameters: {"fused_muon_kernel":true,"batched_ema":true,"loader_prealloc":true}

Novel Contributions

  • Systems-level optimization of the dual-lane parallel residual architecture without changing the ML
  • Fused Muon kernel combining momentum update, Nesterov extrapolation, row normalization, and Newton-Schulz orthogonalization
  • Batched EMA using foreach operations
  • Reusable numpy preallocated data loader buffer
  • Extra training steps within the same 600s budget due to improved throughput
  • Mixed int6/int8 GPTQ quantization with SDClip and byte-shuffle/Brotli artifact compression
  • Score-first chunk-based TTT with legal causal evaluation