PR #141

open

Non-record: Systematic Hyperparameter Search (val_bpb=1.2075)

val_bpb
1.2075
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.2 MB

Training Techniques

Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"lr":0.02}
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Other
other
Systematic hyperparameter search across 33 experiments with fixed-seed paired comparison and one-variable-at-a-time validation across multiple GPU tiers.
parameters: {"seed":1337,"experiments":33,"gpu_tiers":["A40","1xH100","8xH100"]}
other
Compatibility fix for PyTorch 2.4 by replacing enable_gqa with manual repeat_interleave for GQA.
parameters: {"framework":"PyTorch 2.4"}
other
Scaled training and validation on 8xH100 SXM for 600 seconds wallclock.
parameters: {"training_time_seconds":600,"gpus":8}
Sequence Length
sequence_length
train_length: 4096
eval_length: null

Novel Contributions

  • Methodical hyperparameter search with fixed-seed paired comparison for reliable small-delta measurement
  • Validation that Muon optimizer with lr=0.02, momentum=0.99, and warmdown=3000 improves BPB
  • Use of ROPE_BASE=200000 to improve performance
  • Training with sequence length 4096 to improve BPB
  • Insight that optimal hyperparameters transfer poorly across compute budgets and must be re-tuned at target scale