PR #510

open

Non-record: MUD optimizer — triangular Gram preconditioning (arxiv:2603.17970)

by SelfAnushView on GitHub
val_bpb
1.1989
Architecture
Transformer
Optimizer
MUD
Artifact Size
15.9 MB

Training Techniques

Quantization
mixed int5/int6
bits: null
scope: MLP weights (int5), attention weights (int6)
Architecture
SmearGate + BigramHash
SmearGate and BigramHash(10240, dim=128) used for improved model structure
parameters: {"BigramHash_size":10240,"BigramHash_dim":128,"layers":10,"hidden_dim":1536,"heads":8,"KV_heads":4}
MLP3x
MLP with 3x expansion and relu² activation
parameters: {"expansion_factor":3,"activation":"relu²"}
weight tying
Tied embeddings
parameters: null
U-Net skip connections
Use of U-Net style skip connections
parameters: null
Optimizer
MUD
weight_decay: null
momentum: null
other_params: {"mud_whiten_replaces":"zeropower_via_newtonschulz5","passes":1,"eps":1e-7}
Weight Averaging
SWA
parameters: {"start_frac":0.4}
Evaluation
sliding window eval
parameters: {"stride":64}

Novel Contributions

  • Replacing Muon's 5-step Newton-Schulz iteration with MUD's triangular Gram preconditioning (Algorithm 2 from arxiv:2603.17970).
  • MUD optimizer is 12x cheaper in FLOPs per step compared to Muon5 and replaces expensive Gram matrix formation and polynomial iteration with a triangular solve approach.
  • Demonstrated strong convergence with fewer steps but slower per-step throughput on H100 GPUs due to CUDA kernel inefficiencies.
  • Maintains all other training components identical to Muon SOTA for direct comparison.
  • Detailed analysis of throughput differences across GPU architectures (A100/MI250/GH200 vs H100).