PR #141

open

Non-record: Systematic Hyperparameter Search (val_bpb=1.2075)

by nglainView on GitHub

val_bpb

1.2075

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.2 MB

Training Techniques

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"lr":0.02}

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Other

other

Systematic hyperparameter search across 33 experiments with fixed-seed paired comparison and one-variable-at-a-time validation across multiple GPU tiers.

parameters: {"seed":1337,"experiments":33,"gpu_tiers":["A40","1xH100","8xH100"]}

other

Compatibility fix for PyTorch 2.4 by replacing enable_gqa with manual repeat_interleave for GQA.

parameters: {"framework":"PyTorch 2.4"}

other

Scaled training and validation on 8xH100 SXM for 600 seconds wallclock.

parameters: {"training_time_seconds":600,"gpus":8}

Sequence Length

sequence_length

train_length: 4096

eval_length: null

Novel Contributions

Methodical hyperparameter search with fixed-seed paired comparison for reliable small-delta measurement
Validation that Muon optimizer with lr=0.02, momentum=0.99, and warmdown=3000 improves BPB
Use of ROPE_BASE=200000 to improve performance
Training with sequence length 4096 to improve BPB
Insight that optimal hyperparameters transfer poorly across compute budgets and must be re-tuned at target scale