val_bpb
1.3430
Architecture
Transformer
Optimizer
AdamW
Artifact Size
14,279,445 bytes
Training Techniques
LR Schedule
WSD
parameters: {"stable_fraction":0.6,"decay_shape":"linear","min_lr_frac":0.1,"use_wsd":1}
Regularization
logit softcap
parameters: {"value":30}
Compression
zlib
level: null
Novel Contributions
- Adds an optional Warmup-Stable-Decay (WSD) learning-rate schedule patch gated by USE_WSD=1
- Improves the repo-root naive baseline on a 1×H100 / 600s wallclock setup
- Demonstrates a 3-seed mean post-quant val_bpb improvement over the naive baseline
- Includes negative-result ablations from the same experiment loop