← Back to Optimizer

Muon and Adam for training; SGD with momentum for TTT

Optimizer
Used in
1 PRs
Best BPB
1.0944
Avg BPB
1.0944

Hyperparameters Across PRs

pr_numberweight_decaymomentumother_params
6440.040.9{"matrix_lr":0.025,"scalar_lr":0.025,"embed_lr":0.035,"decoder_lr_mult":2,"grad_clip":0.3,"ema_decay":0.997,"SGD_lr":0.002,"SGD_epochs_per_chunk":10,"SGD_chunk_size":32768,"SGD_stride":64,"SGD_frozen_blocks":2,"SGD_grad_clip":1}