PR #360
openNon-record: QAT & EMA negative results on SOTA stack (val_bpb=1.1426)
by MultiFe22View on GitHub
val_bpb
1.1426
Architecture
—
Optimizer
—
Artifact Size
15.99 MB
Training Techniques
Quantization
STE QAT
bits: 5
scope: MLP
STE QAT
bits: 6
scope: attention
Weight Averaging
EMA
parameters: {"decay":0.9999,"start_step":500}
Other
other
QAT warmup delay before enabling fake quantization
parameters: {"warmup_steps":500}
Novel Contributions
- Baseline reproduction of PR #180 at val_bpb 1.1426
- Ablation showing that QAT improves artifact compression but reduces training throughput enough to hurt validation performance under the 10-minute budget
- Ablation showing that EMA causes severe throughput loss due to CPU cloning every step
- Demonstration that step-budget-constrained training makes throughput-costly techniques counterproductive