PR #510

open

Non-record: MUD optimizer — triangular Gram preconditioning (arxiv:2603.17970)

val_bpb

1.1989

Architecture

Transformer

Optimizer

MUD

Artifact Size

15.9 MB

Training Techniques

Quantization

mixed int5/int6

bits: null

scope: MLP weights (int5), attention weights (int6)

Architecture

SmearGate + BigramHash

SmearGate and BigramHash(10240, dim=128) used for improved model structure

parameters: {"BigramHash_size":10240,"BigramHash_dim":128,"layers":10,"hidden_dim":1536,"heads":8,"KV_heads":4}

MLP3x

MLP with 3x expansion and relu² activation

parameters: {"expansion_factor":3,"activation":"relu²"}

weight tying

Tied embeddings

parameters: null

U-Net skip connections

Use of U-Net style skip connections

parameters: null

Optimizer

MUD

weight_decay: null

momentum: null

other_params: {"mud_whiten_replaces":"zeropower_via_newtonschulz5","passes":1,"eps":1e-7}

Weight Averaging

SWA

parameters: {"start_frac":0.4}

Evaluation

sliding window eval

parameters: {"stride":64}

Replacing Muon's 5-step Newton-Schulz iteration with MUD's triangular Gram preconditioning (Algorithm 2 from arxiv:2603.17970).
MUD optimizer is 12x cheaper in FLOPs per step compared to Muon5 and replaces expensive Gram matrix formation and polynomial iteration with a triangular solve approach.
Demonstrated strong convergence with fewer steps but slower per-step throughput on H100 GPUs due to CUDA kernel inefficiencies.
Maintains all other training components identical to Muon SOTA for direct comparison.
Detailed analysis of throughput differences across GPU architectures (A100/MI250/GH200 vs H100).