PR #1606
openNon-Record v2: 7L UNet + Int8 QAT + EMA + Long Train — 1.3969 BPB (DGX Spark)
by AlirezaAlampourView on GitHub
val_bpb
1.3969
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.5 MB
Training Techniques
Architecture
U-Net skip connections
U-Net style encoder/decoder skip connections with learned per-block residual mixing.
parameters: {"layers":7,"d":512,"heads":8,"kv_heads":4,"mlp_mult":4}
weight tying
Tied embeddings / tied output projection.
parameters: null
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
LeakyReLU
LeakyReLU squared activation in the MLP.
parameters: {"negative_slope":0.5,"squared":true}
Quantization
int8 QAT
bits: 8
scope: all weight matrices
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: null
momentum: 0.9382982028913158
other_params: {"Newton_Schulz_orthogonalization":true,"adam_for_embeddings":true}
Compression
zlib
level: null
Regularization
logit softcap
parameters: {"cap":30}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
LR Schedule
warmdown
parameters: {"warmdown_iters":1558}
Novel Contributions
- 7-layer U-Net Transformer with learned skip connections and residual mixing
- Int8 quantization-aware training with roundtrip validation
- EMA weight averaging for final checkpoint serialization
- Longer 4-hour training budget to reach ~1000 steps on DGX Spark
- 4x wider MLP tradeoff with fewer layers for better low-step performance
- Cross-seed validation showing stable performance across three seeds