PR #1190

open

Non-record: 10L MLP3x + Muon | val_bpb=1.3365 | Single Colab GPU

by DurlabhkumarjhaView on GitHub
val_bpb
1.3365
Architecture
Transformer
Optimizer
Muon
Artifact Size

Training Techniques

Architecture
MLP3x
Reduced MLP expansion to 3× and increased depth to 10 layers in the Transformer.
parameters: {"layers":10,"mlp_expansion":3}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Sequence Length
sequence_length
train_length: 2048
eval_length: null

Novel Contributions

  • 10-layer Transformer variant
  • 3× MLP expansion
  • Muon optimizer on single-GPU Colab hardware
  • 2048 training sequence length