val_bpb
1.2257
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,879,569 bytes
Training Techniques
Optimizer
Muon
weight_decay: 0.085
momentum: null
other_params: {"backend_steps":3}
Regularization
weight decay
parameters: {"decay":0.085,"decoupled":true,"applied_to":"matrix parameters"}
Other
other
MuonEq-R: row-normalize each gradient matrix before Newton-Schulz orthogonalization inside Muon.
parameters: null
Architecture
weight tying
Tied embeddings used in the baseline configuration.
parameters: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Novel Contributions
- MuonEq-R row-normalization before Newton-Schulz orthogonalization
- Decoupled Muon weight decay applied AdamW-style to matrix parameters
- Improved SP-1024 baseline without architecture changes or added parameters