PR #2094

open

Non-record: WSD LR schedule on naive baseline (1×H100)

by GitcatmeowwView on GitHub
val_bpb
1.3430
Architecture
Transformer
Optimizer
AdamW
Artifact Size
14,279,445 bytes

Training Techniques

LR Schedule
WSD
parameters: {"stable_fraction":0.6,"decay_shape":"linear","min_lr_frac":0.1,"use_wsd":1}
Regularization
logit softcap
parameters: {"value":30}
Compression
zlib
level: null

Novel Contributions

  • Adds an optional Warmup-Stable-Decay (WSD) learning-rate schedule patch gated by USE_WSD=1
  • Improves the repo-root naive baseline on a 1×H100 / 600s wallclock setup
  • Demonstrates a 3-seed mean post-quant val_bpb improvement over the naive baseline
  • Includes negative-result ablations from the same experiment loop