PR #438

open

Non-Record: Replace Muon optimizer with NorMuon for baseline (1xH100)

by stevenshinechenView on GitHub
val_bpb
1.3458
Architecture
Transformer
Optimizer
NorMuon
Artifact Size

Training Techniques

Optimizer
NorMuon
weight_decay: null
momentum: null
other_params: {"beta2":0.95,"second_momentum_buffer":true,"newton_schulz":true}
Other
other
Replaces Muon with NorMuon, adding neuron-wise normalization of update magnitudes after Newton-Schulz orthogonalization and before Muon scale correction.
parameters: {"beta2":0.95}

Novel Contributions

  • Replaced Muon optimizer with NorMuon as a baseline improvement
  • Added neuron-wise normalization of update magnitudes using a second-order momentum buffer
  • Applied NorMuon after Newton-Schulz orthogonalization but before Muon scale correction
  • Used a modified implementation based on the original NorMuon code with float32 buffer handling and numerical stability tweaks