← Back to Optimizer

Muon + AdamW

Optimizer
Used in
5 PRs
Best BPB
1.1160
Avg BPB
1.2892

Hyperparameters Across PRs

pr_numberweight_decaymomentumother_params
525
531{"lr_matrix":0.02,"lr_embedding":0.03,"lr_scalar":0.02,"grad_accum_steps":8}
536
5590.04
5830.0450.99{"learning_rates":{"matrix":0.035,"tied_embed":0.045,"scalar":0.035},"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"grad_clip_norm":0.35,"warmdown_iters":2000,"warmup_steps":20,"batch_tokens":786432,"sequence_length":2048}