PR #1606

open

Non-Record v2: 7L UNet + Int8 QAT + EMA + Long Train — 1.3969 BPB (DGX Spark)

by AlirezaAlampourView on GitHub

val_bpb

1.3969

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.5 MB

Training Techniques

Architecture

U-Net skip connections

U-Net style encoder/decoder skip connections with learned per-block residual mixing.

parameters: {"layers":7,"d":512,"heads":8,"kv_heads":4,"mlp_mult":4}

weight tying

Tied embeddings / tied output projection.

parameters: null

GQA

Grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

LeakyReLU

LeakyReLU squared activation in the MLP.

parameters: {"negative_slope":0.5,"squared":true}

Quantization

int8 QAT

bits: 8

scope: all weight matrices

Weight Averaging

EMA

parameters: {"decay":0.997}

Optimizer

Muon

weight_decay: null

momentum: 0.9382982028913158

other_params: {"Newton_Schulz_orthogonalization":true,"adam_for_embeddings":true}

Compression

zlib

level: null

Regularization

logit softcap

parameters: {"cap":30}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

LR Schedule

warmdown

parameters: {"warmdown_iters":1558}

Novel Contributions

7-layer U-Net Transformer with learned skip connections and residual mixing
Int8 quantization-aware training with roundtrip validation
EMA weight averaging for final checkpoint serialization
Longer 4-hour training budget to reach ~1000 steps on DGX Spark
4x wider MLP tradeoff with fewer layers for better low-step performance
Cross-seed validation showing stable performance across three seeds