← Back to Optimizer
Muon and Adam for training; SGD with momentum for TTT
OptimizerUsed in
1 PRs
Best BPB
1.0944
Avg BPB
1.0944
Submissions
Hyperparameters Across PRs
| pr_number | weight_decay | momentum | other_params |
|---|---|---|---|
| 644 | 0.04 | 0.9 | {"matrix_lr":0.025,"scalar_lr":0.025,"embed_lr":0.035,"decoder_lr_mult":2,"grad_clip":0.3,"ema_decay":0.997,"SGD_lr":0.002,"SGD_epochs_per_chunk":10,"SGD_chunk_size":32768,"SGD_stride":64,"SGD_frozen_blocks":2,"SGD_grad_clip":1} |