PR #48

closed

[Submission] Warmdown Scheduling - 1.2430 BPB on 8×H100 SXM

val_bpb

1.2381

Architecture

Transformer

Optimizer

—

Artifact Size

15.85MB

Training Techniques

LR Schedule

warmdown

parameters: {"warmdown_iters":3000,"max_wallclock_seconds":600}

Architecture

KV head count

Uses fewer KV heads than attention heads in the GPT-style model.

parameters: {"num_heads":8,"num_kv_heads":4,"layers":9,"model_dim":512}

Sequence Length

sequence_length

train_length: null

eval_length: null

Increased WARMDOWN_ITERS so cosine warmdown actually triggers within the wallclock-limited training run.
Improved convergence by ensuring learning rate decay occurs in the final portion of training.
Used a 3000-iteration warmdown schedule instead of the default 1200 iterations.