PR #2016

open

NorMuon + Deeper U-Net with INT6 Fake Quantization

val_bpb

1.2302

Architecture

Transformer

Optimizer

NorMuon

Artifact Size

15,388,319 B

Training Techniques

Optimizer

NorMuon

weight_decay: 0.1

momentum: null

other_params: {"beta2":0.95}

Quantization

STE QAT

bits: 6

scope: attention and MLP activations

Architecture

U-Net skip connections

Increased model depth and skip connections in the U-Net-style architecture.

parameters: {"layers":12,"encoder_layers":6,"decoder_layers":6,"skip_connections":6}

Regularization

weight decay

parameters: {"weight_decay":0.1}