PR #1930

open

Non-record: GPT + QAT int8 — 1.2600 BPB (1x RTX 4090, 60min)

by CarlosItpView on GitHub
val_bpb
1.2600
Architecture
GPT-2
Optimizer
Artifact Size
15.86 MB

Training Techniques

Quantization
STE QAT
bits: 8
scope: model weights

Novel Contributions

  • GPT baseline with int8 quantization-aware training
  • Straight-through estimator used in CastedLinear.forward() for int8 simulation
  • Training for robustness to post-training int8 quantization