PR #527

open

Submit 1x A100 QAT Fix - 1.4078 BPB (Non-Record)

by Shuvam-Banerji-SealView on GitHub
val_bpb
1.4078
Architecture
Transformer
Optimizer
Artifact Size
15.77 MB

Training Techniques

Quantization
QAT
bits: 6
scope: null
Weight Averaging
SWA
parameters: null
Evaluation
sliding window eval
parameters: null
LR Schedule
custom tuning from multi-device to single-device scale
parameters: null
Other
other
Replaced torch.quantile with w.abs().amax(dim=1).clamp_min to avoid 30x compiler performance penalty in Triton
parameters: null

Novel Contributions

  • Tuned hyperparameters from multi-device to single-device (1x A100) scale to ensure proper learning rate scheduling and fit within 10-minute training constraint
  • Replaced torch.quantile with w.abs().amax(dim=1).clamp_min in CastedLinear to bypass a severe 30x GPU performance penalty caused by Triton compiler
  • Constrained gradient accumulation size to 131K tokens to allow 2600 descending iterations and proper LR decay schedule
  • Used int6 quantization-aware training (QAT) to reduce artifact size while maintaining performance
  • Graceful termination into SWA and final sliding-window evaluation