PR #2016

open

NorMuon + Deeper U-Net with INT6 Fake Quantization

by sea-rodView on GitHub
val_bpb
1.2302
Architecture
Transformer
Optimizer
NorMuon
Artifact Size
15,388,319 B

Training Techniques

Optimizer
NorMuon
weight_decay: 0.1
momentum: null
other_params: {"beta2":0.95}
Quantization
STE QAT
bits: 6
scope: attention and MLP activations
Architecture
U-Net skip connections
Increased model depth and skip connections in the U-Net-style architecture.
parameters: {"layers":12,"encoder_layers":6,"decoder_layers":6,"skip_connections":6}
Regularization
weight decay
parameters: {"weight_decay":0.1}

Novel Contributions

  • Replaced Muon with NorMuon to balance per-neuron update magnitudes
  • Added decoupled weight decay inside the optimizer
  • Lowered q_gain initialization from 1.5 to 1.0
  • Applied INT6 fake quantization to attention and MLP activations during training
  • Increased U-Net depth from 9 to 12 layers with more skip connections