PR #1942

open

Submission/muoneqr weight decay

val_bpb

1.2257

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,879,569 bytes

Training Techniques

Optimizer

Muon

weight_decay: 0.085

momentum: null

other_params: {"backend_steps":3}

Regularization

weight decay

parameters: {"decay":0.085,"decoupled":true,"applied_to":"matrix parameters"}

Other

other

MuonEq-R: row-normalize each gradient matrix before Newton-Schulz orthogonalization inside Muon.

parameters: null

Architecture

weight tying

Tied embeddings used in the baseline configuration.

parameters: null

Sequence Length

sequence_length

train_length: 1024

eval_length: null