val_bpb
1.2194
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.88MB
Training Techniques
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500}
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Regularization
gradient clipping
parameters: {"grad_clip_norm":0.3}
Architecture
MLP3x
Increased MLP multiplier from 2 to 3.
parameters: {"mlp_mult":3}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Other
other
Training on validation data enabled.
parameters: {"train_on_val":1}
Compression
zlib
level: null
Novel Contributions
- Optimizer hyperparameter tuning derived from analysis of top-scoring submissions
- Uses unmodified baseline training script with all changes applied via environment variables
- Longer training sequence length (2048)
- Higher Muon momentum with warmup schedule
- Gradient clipping enabled
- Increased MLP multiplier
- Training on validation data enabled