PR #1682

open

Non-record: GradPower for Muon prefers p<1 in matched H100 ablation

by PapaFranku4647View on GitHub
val_bpb
1.2834
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,460,600 bytes

Training Techniques

Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"grad_power":0.9,"grad_power_formula":"g = sign(g) * abs(g) ** p","muon_grad_power":true}
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Other
other
Gradient power transformation applied to Muon matrix gradients before momentum and orthogonalization; best exponent found below 1 (p=0.9) rather than the paper default p=1.2.
parameters: {"best_p":0.9,"tested_range":[0.85,1.2]}

Novel Contributions

  • Applied GradPower-style gradient exponentiation to Muon in the Parameter Golf training stack
  • Found that p<1 transfers better than the paper default p=1.2 in this Muon-heavy regime
  • Reported a matched H100 ablation showing p=0.9 beats vanilla Muon at seed 1337
  • Ran a local 3-seed 4080 sweep showing a consistent improvement for p<1
  • Provided a negative-result note that p=1.2 is harmful in this setting