PR #1942

open

Submission/muoneqr weight decay

val_bpb
1.2257
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,879,569 bytes

Training Techniques

Optimizer
Muon
weight_decay: 0.085
momentum: null
other_params: {"backend_steps":3}
Regularization
weight decay
parameters: {"decay":0.085,"decoupled":true,"applied_to":"matrix parameters"}
Other
other
MuonEq-R: row-normalize each gradient matrix before Newton-Schulz orthogonalization inside Muon.
parameters: null
Architecture
weight tying
Tied embeddings used in the baseline configuration.
parameters: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null

Novel Contributions

  • MuonEq-R row-normalization before Newton-Schulz orthogonalization
  • Decoupled Muon weight decay applied AdamW-style to matrix parameters
  • Improved SP-1024 baseline without architecture changes or added parameters