PR #707

open

Submit 1x A100 QAT Fix - 1.4078 BPB (Non-Record) [v2]

by Shuvam-Banerji-SealView on GitHub
val_bpb
1.4078
Architecture
modded-nanogpt-derived Transformer
Optimizer
Artifact Size
15.77 MB

Training Techniques

Quantization
QAT
bits: 6
scope: all
Weight Averaging
SWA
parameters: null
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: null
LR Schedule
LR scheduling tuned for single-device run
parameters: {"gradient_accum_tokens":131000,"iterations":2600}
Other
other
Replaced torch.quantile with w.abs().amax(dim=1) / w.amax().clamp_min-based clip factor estimation to avoid Triton compiler performance penalty
parameters: null

Novel Contributions

  • Adjusted hyperparameters from multi-device scales to single-A100 scales so the LR schedule completes properly
  • Replaced torch.quantile with amax-based clip factor estimation to avoid a severe Triton compiler performance penalty
  • Reduced gradient accumulation sizing to 131K tokens so training reaches 2600 iterations within the time budget
  • Addressed prior review feedback on unused dependencies and imports