PR #1072

open

Record: Fused LeakyReLU² + Online GPTQ + Parallel Muon — val_bpb 1.117 (1-seed)

val_bpb
1.1170
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.95 MB

Training Techniques

Architecture
LeakyReLU
Uses LeakyReLU(0.5) squared in the MLP, fused with the up-projection and down-projection for efficiency.
parameters: {"slope":0.5,"squared":true}
XSA
Applies XSA attention across all layers.
parameters: {"layers":11}
BigramHash
Uses BigramHash embeddings.
parameters: {"dimensions":4096}
VE128
Uses value embeddings in later layers.
parameters: {"layers":[9,10]}
SmearGate
Includes SmearGate in the architecture.
parameters: null
Partial RoPE
Uses partial rotary position embeddings.
parameters: {"dimensions":"16/64"}
U-Net skip connections
Uses U-Net style skip connections with encoder-decoder structure.
parameters: {"encoder":5,"decoder":6}
GQA
Uses grouped query attention with fewer KV heads than query heads.
parameters: {"query_heads":8,"kv_heads":4}
Regularization
logit softcap
parameters: {"value":30}
LN scale
parameters: {"formula":"1/sqrt(layer+1)"}
magnitude pruning
parameters: {"type":"selective ±1 pruning"}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"parameter_banking":true,"overlapped_reduce_scatter_all_gather":true,"ddp":false}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"every_steps":50}
Quantization
GPTQ
bits: 6
scope: all
QAT
bits: null
scope: all
Compression
lzma
level: 9
Evaluation
sliding window eval
parameters: {"stride":16,"context_length":2048}
Sequence Length
sequence_length
train_length: null
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}

Novel Contributions

  • Fused Triton MLP kernel combining linear, LeakyReLU(0.5), and square into one GPU pass
  • Online Hessian GPTQ accumulation during training via periodic uncompiled forward passes
  • Selective ±1 pruning to fit the artifact under the 16MB limit
  • Parallel Muon training with overlapped communication
  • Sliding window evaluation with stride 16