PR #145

closed

Non-record: QAT ablation — int8 QAT overhead exceeds quantization gap recovery

val_bpb

1.2052

Architecture

Transformer

Optimizer

—

Artifact Size

15,868,103 bytes

Training Techniques

Quantization

QAT

bits: 8

scope: per-row weights

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_steps":1200}

Architecture

tied embeddings

Uses tied input/output embeddings as part of the baseline 9L×512d architecture.

parameters: {"layers":9,"dimensions":512}

Other

other

Straight-through estimator fake-quantization inserted into linear layers during training to match export-time int8 quantization exactly.

parameters: {"qat_start_step":6000,"qat_fraction":0.3}

Clean ablation of per-row int8 quantization-aware training on the baseline 9L×512d model.
Exact percentile-matching QAT implementation using the same clipping percentile and per-row scale as export quantization.
Measured that torch.quantile-based QAT adds about 20% per-step overhead, reducing total training steps under the 10-minute budget.
Identified that int8 QAT did not recover enough quantization gap to offset the lost training progress.
Observed a torch.compile graph priming pitfall where pre-compiling both QAT and non-QAT paths slowed the non-QAT forward pass.