PR #2094

open

Non-record: WSD LR schedule on naive baseline (1×H100)

val_bpb

1.3430

Architecture

Transformer

Optimizer

AdamW

Artifact Size

14,279,445 bytes

Training Techniques

LR Schedule

WSD

parameters: {"stable_fraction":0.6,"decay_shape":"linear","min_lr_frac":0.1,"use_wsd":1}

Regularization

logit softcap

parameters: {"value":30}

Compression

zlib

level: null

Adds an optional Warmup-Stable-Decay (WSD) learning-rate schedule patch gated by USE_WSD=1
Improves the repo-root naive baseline on a 1×H100 / 600s wallclock setup
Demonstrates a 3-seed mean post-quant val_bpb improvement over the naive baseline
Includes negative-result ablations from the same experiment loop