PR #1484

closed

Non-Record: U-Net Transformer + Int8 QAT + LeakyReLU² + Muon — 1.6656 BPB (DGX Spark)

by AlirezaAlampourView on GitHub

val_bpb

1.6656

Architecture

Transformer

Optimizer

Muon

Artifact Size

~8.8 MB

Training Techniques

Architecture

U-Net skip connections

Encoder-decoder style skip connections with learned skip weights and per-block residual mixing from input embeddings.

parameters: null

GQA

Uses grouped query attention with fewer KV heads than attention heads.

parameters: {"layers":9,"dim":512,"heads":8,"kv_heads":4}

LeakyReLU

Leaky ReLU squared activation in the MLP.

parameters: {"negative_slope":0.5,"squared":true}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"newton_schulz_orthogonalization":true}

Adam

weight_decay: null

momentum: null

other_params: {"used_for":"embeddings"}

Quantization

int8

bits: 8

scope: all weight matrices

STE QAT

bits: 8

scope: forward pass

Weight Averaging

EMA

parameters: null

Compression

zlib

level: 9

Regularization

logit softcap

parameters: {"cap":30,"activation":"tanh"}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

LR Schedule

warmdown

parameters: {"wallclock_based":true}

Novel Contributions

U-Net skip connections in a Transformer language model
LeakyReLU squared activation
Muon optimizer with Newton-Schulz orthogonalization
Int8 per-row QAT with straight-through estimator
EMA weight averaging
zlib-compressed serialized checkpoint
Training and evaluation on a single NVIDIA DGX Spark consumer Blackwell GPU