PR #48

closed

[Submission] Warmdown Scheduling - 1.2430 BPB on 8×H100 SXM

by MajdiZamimView on GitHub
val_bpb
1.2381
Architecture
Transformer
Optimizer
Artifact Size
15.85MB

Training Techniques

LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"max_wallclock_seconds":600}
Architecture
KV head count
Uses fewer KV heads than attention heads in the GPT-style model.
parameters: {"num_heads":8,"num_kv_heads":4,"layers":9,"model_dim":512}
Sequence Length
sequence_length
train_length: null
eval_length: null

Novel Contributions

  • Increased WARMDOWN_ITERS so cosine warmdown actually triggers within the wallclock-limited training run.
  • Improved convergence by ensuring learning rate decay occurs in the final portion of training.
  • Used a 3000-iteration warmdown schedule instead of the default 1200 iterations.