val_bpb
1.2075
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.2 MB
Training Techniques
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"lr":0.02}
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Other
other
Systematic hyperparameter search across 33 experiments with fixed-seed paired comparison and one-variable-at-a-time validation across multiple GPU tiers.
parameters: {"seed":1337,"experiments":33,"gpu_tiers":["A40","1xH100","8xH100"]}
other
Compatibility fix for PyTorch 2.4 by replacing enable_gqa with manual repeat_interleave for GQA.
parameters: {"framework":"PyTorch 2.4"}
other
Scaled training and validation on 8xH100 SXM for 600 seconds wallclock.
parameters: {"training_time_seconds":600,"gpus":8}
Sequence Length
sequence_length
train_length: 4096
eval_length: null
Novel Contributions
- Methodical hyperparameter search with fixed-seed paired comparison for reliable small-delta measurement
- Validation that Muon optimizer with lr=0.02, momentum=0.99, and warmdown=3000 improves BPB
- Use of ROPE_BASE=200000 to improve performance
- Training with sequence length 4096 to improve BPB
- Insight that optimal hyperparameters transfer poorly across compute budgets and must be re-tuned at target scale