PR #751

open

Single A100 QAT Performance Fix (fresh review cycle)

by Shuvam-Banerji-SealView on GitHub

val_bpb

1.5252

Architecture

modded-nanogpt-derived Transformer

Optimizer

—

Artifact Size

15.77 MB

Training Techniques

Quantization

QAT

bits: null

scope: all

Other

other

Replaced torch.quantile-based clip factor estimation in CastedLinear with w.abs().amax(dim=1) to avoid Triton compiler performance penalties

parameters: null

other

Reduced gradient accumulation sizing to 131K tokens for single-A100 training within the 10-minute wallclock cap

parameters: {"gradient_accum_tokens":131000}

Weight Averaging

SWA

parameters: null

Evaluation

sliding window eval

parameters: null

Compression

zlib

level: null

Switched CastedLinear clip factor estimation from torch.quantile to w.abs().amax(dim=1) to avoid a severe Triton performance penalty
Adjusted gradient accumulation sizing to 131K tokens so QAT training fits within the 10-minute single-A100 wallclock budget
Reported final submission val_bpb from the post-export sliding-window roundtrip metric rather than the intermediate train-time checkpoint metric
Aligned README and submission reporting/runtime wording with the measured single-A100 run