← Back to Optimizer

Muon + Adam

Optimizer
Used in
7 PRs
Best BPB
1.1387
Avg BPB
1.1914

Hyperparameters Across PRs

pr_numberweight_decaymomentumother_params
148{"matrix_lr":0.012,"scalar_lr":0.012,"tied_embed_lr":0.015,"grad_clip_norm":0.3}
319{"split_optimizer":true}
537{"Muon":"used for hidden and attention parameters","Adam":"used for embeddings and scalar parameters"}
550{"Muon_scope":"matrices","Adam_scope":"scalars"}
5750.04{"MATRIX_LR":0.02,"SCALAR_LR":0.02,"TIED_EMBED_LR":0.05}
6370.04{"embed_lr":0.03,"matrix_lr":0.02,"scalar_lr":0.02}
835{"matrix_lr":0.012,"scalar_lr":0.012,"tied_embed_lr":0.015,"grad_clip_norm":0.3}