PR #1713

open

Non-record submission: baseline_sp1024, val_bpb=1.3479(on single H100), AbhiShet108

by AbhiShet108View on GitHub
val_bpb
1.3479
Architecture
Transformer
Optimizer
Artifact Size
14,672,726 bytes

Training Techniques

Other
other
Increased MATRIX_LR from 0.04 to 0.08 to test learning-rate sensitivity on a single H100.
parameters: {"MATRIX_LR":0.08}

Novel Contributions

  • Learning-rate sensitivity experiment with MATRIX_LR doubled from 0.04 to 0.08
  • Single-H100 non-record baseline run for the 10min/16MB track
  • Documentation of compute and storage constraints affecting training