PR #1167
openSubmission: val_bpb=1.3736 | 10 layers + Muon + mlp_mult=3Update default NUM_LAYERS and MLP_MULT values
by DurlabhkumarjhaView on GitHub
val_bpb
1.3736
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Architecture
MLP3x
Uses an MLP multiplier of 3.
parameters: {"mlp_mult":3}
Transformer
Uses 10 layers.
parameters: {"layers":10}
Novel Contributions
- 10-layer model configuration
- Muon optimizer
- MLP multiplier set to 3
- Updated default NUM_LAYERS and MLP_MULT values