PR #438
openNon-Record: Replace Muon optimizer with NorMuon for baseline (1xH100)
by stevenshinechenView on GitHub
val_bpb
1.3458
Architecture
Transformer
Optimizer
NorMuon
Artifact Size
—
Training Techniques
Optimizer
NorMuon
weight_decay: null
momentum: null
other_params: {"beta2":0.95,"second_momentum_buffer":true,"newton_schulz":true}
Other
other
Replaces Muon with NorMuon, adding neuron-wise normalization of update magnitudes after Newton-Schulz orthogonalization and before Muon scale correction.
parameters: {"beta2":0.95}
Novel Contributions
- Replaced Muon optimizer with NorMuon as a baseline improvement
- Added neuron-wise normalization of update magnitudes using a second-order momentum buffer
- Applied NorMuon after Newton-Schulz orthogonalization but before Muon scale correction
- Used a modified implementation based on the original NorMuon code with float32 buffer handling and numerical stability tweaks