PR #1190
openNon-record: 10L MLP3x + Muon | val_bpb=1.3365 | Single Colab GPU
by DurlabhkumarjhaView on GitHub
val_bpb
1.3365
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Architecture
MLP3x
Reduced MLP expansion to 3× and increased depth to 10 layers in the Transformer.
parameters: {"layers":10,"mlp_expansion":3}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Novel Contributions
- 10-layer Transformer variant
- 3× MLP expansion
- Muon optimizer on single-GPU Colab hardware
- 2048 training sequence length