Muon and Adam for training; SGD with momentum for TTT

Optimizer

Used in

1 PRs

Best BPB

1.0944

Avg BPB

1.0944

Submissions

pr_number	weight_decay	momentum	other_params
644	0.04	0.9	{"matrix_lr":0.025,"scalar_lr":0.025,"embed_lr":0.035,"decoder_lr_mult":2,"grad_clip":0.3,"ema_decay":0.997,"SGD_lr":0.002,"SGD_epochs_per_chunk":10,"SGD_chunk_size":32768,"SGD_stride":64,"SGD_frozen_blocks":2,"SGD_grad_clip":1}