PR #438

open

Non-Record: Replace Muon optimizer with NorMuon for baseline (1xH100)

by stevenshinechenView on GitHub

val_bpb

1.3458

Architecture

Transformer

Optimizer

NorMuon

Artifact Size

—

Training Techniques

Optimizer

NorMuon

weight_decay: null

momentum: null

other_params: {"beta2":0.95,"second_momentum_buffer":true,"newton_schulz":true}

Other

other

Replaces Muon with NorMuon, adding neuron-wise normalization of update magnitudes after Newton-Schulz orthogonalization and before Muon scale correction.

parameters: {"beta2":0.95}

Novel Contributions

Replaced Muon optimizer with NorMuon as a baseline improvement
Added neuron-wise normalization of update magnitudes using a second-order momentum buffer
Applied NorMuon after Newton-Schulz orthogonalization but before Muon scale correction
Used a modified implementation based on the original NorMuon code with float32 buffer handling and numerical stability tweaks