PR #1484

closed

Non-Record: U-Net Transformer + Int8 QAT + LeakyReLU² + Muon — 1.6656 BPB (DGX Spark)

by AlirezaAlampourView on GitHub
val_bpb
1.6656
Architecture
Transformer
Optimizer
Muon
Artifact Size
~8.8 MB

Training Techniques

Architecture
U-Net skip connections
Encoder-decoder style skip connections with learned skip weights and per-block residual mixing from input embeddings.
parameters: null
GQA
Uses grouped query attention with fewer KV heads than attention heads.
parameters: {"layers":9,"dim":512,"heads":8,"kv_heads":4}
LeakyReLU
Leaky ReLU squared activation in the MLP.
parameters: {"negative_slope":0.5,"squared":true}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"newton_schulz_orthogonalization":true}
Adam
weight_decay: null
momentum: null
other_params: {"used_for":"embeddings"}
Quantization
int8
bits: 8
scope: all weight matrices
STE QAT
bits: 8
scope: forward pass
Weight Averaging
EMA
parameters: null
Compression
zlib
level: 9
Regularization
logit softcap
parameters: {"cap":30,"activation":"tanh"}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
LR Schedule
warmdown
parameters: {"wallclock_based":true}

Novel Contributions

  • U-Net skip connections in a Transformer language model
  • LeakyReLU squared activation
  • Muon optimizer with Newton-Schulz orthogonalization
  • Int8 per-row QAT with straight-through estimator
  • EMA weight averaging
  • zlib-compressed serialized checkpoint
  • Training and evaluation on a single NVIDIA DGX Spark consumer Blackwell GPU