PR #751

open

Single A100 QAT Performance Fix (fresh review cycle)

by Shuvam-Banerji-SealView on GitHub
val_bpb
1.5252
Architecture
modded-nanogpt-derived Transformer
Optimizer
Artifact Size
15.77 MB

Training Techniques

Quantization
QAT
bits: null
scope: all
Other
other
Replaced torch.quantile-based clip factor estimation in CastedLinear with w.abs().amax(dim=1) to avoid Triton compiler performance penalties
parameters: null
other
Reduced gradient accumulation sizing to 131K tokens for single-A100 training within the 10-minute wallclock cap
parameters: {"gradient_accum_tokens":131000}
Weight Averaging
SWA
parameters: null
Evaluation
sliding window eval
parameters: null
Compression
zlib
level: null

Novel Contributions

  • Switched CastedLinear clip factor estimation from torch.quantile to w.abs().amax(dim=1) to avoid a severe Triton performance penalty
  • Adjusted gradient accumulation sizing to 131K tokens so QAT training fits within the 10-minute single-A100 wallclock budget
  • Reported final submission val_bpb from the post-export sliding-window roundtrip metric rather than the intermediate train-time checkpoint metric
  • Aligned README and submission reporting/runtime wording with the measured single-A100 run