PR #1167

open

Submission: val_bpb=1.3736 | 10 layers + Muon + mlp_mult=3Update default NUM_LAYERS and MLP_MULT values

by DurlabhkumarjhaView on GitHub
val_bpb
1.3736
Architecture
Transformer
Optimizer
Muon
Artifact Size

Training Techniques

Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Architecture
MLP3x
Uses an MLP multiplier of 3.
parameters: {"mlp_mult":3}
Transformer
Uses 10 layers.
parameters: {"layers":10}

Novel Contributions

  • 10-layer model configuration
  • Muon optimizer
  • MLP multiplier set to 3
  • Updated default NUM_LAYERS and MLP_MULT values