PR #145
closedNon-record: QAT ablation — int8 QAT overhead exceeds quantization gap recovery
by mrdavtanView on GitHub
val_bpb
1.2052
Architecture
Transformer
Optimizer
—
Artifact Size
15,868,103 bytes
Training Techniques
Quantization
QAT
bits: 8
scope: per-row weights
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_steps":1200}
Architecture
tied embeddings
Uses tied input/output embeddings as part of the baseline 9L×512d architecture.
parameters: {"layers":9,"dimensions":512}
Other
other
Straight-through estimator fake-quantization inserted into linear layers during training to match export-time int8 quantization exactly.
parameters: {"qat_start_step":6000,"qat_fraction":0.3}
Novel Contributions
- Clean ablation of per-row int8 quantization-aware training on the baseline 9L×512d model.
- Exact percentile-matching QAT implementation using the same clipping percentile and per-row scale as export quantization.
- Measured that torch.quantile-based QAT adds about 20% per-step overhead, reducing total training steps under the 10-minute budget.
- Identified that int8 QAT did not recover enough quantization gap to offset the lost training progress.
- Observed a torch.compile graph priming pitfall where pre-compiling both QAT and non-QAT paths slowed the non-QAT forward pass.