PR #360

open

Non-record: QAT & EMA negative results on SOTA stack (val_bpb=1.1426)

by MultiFe22View on GitHub
val_bpb
1.1426
Architecture
Optimizer
Artifact Size
15.99 MB

Training Techniques

Quantization
STE QAT
bits: 5
scope: MLP
STE QAT
bits: 6
scope: attention
Weight Averaging
EMA
parameters: {"decay":0.9999,"start_step":500}
Other
other
QAT warmup delay before enabling fake quantization
parameters: {"warmup_steps":500}

Novel Contributions

  • Baseline reproduction of PR #180 at val_bpb 1.1426
  • Ablation showing that QAT improves artifact compression but reduces training throughput enough to hurt validation performance under the 10-minute budget
  • Ablation showing that EMA causes severe throughput loss due to CPU cloning every step
  • Demonstration that step-budget-constrained training makes throughput-costly techniques counterproductive