PR #360

open

Non-record: QAT & EMA negative results on SOTA stack (val_bpb=1.1426)

val_bpb

1.1426

Architecture

—

Optimizer

—

Artifact Size

15.99 MB

Training Techniques

Quantization

STE QAT

bits: 5

scope: MLP

STE QAT

bits: 6

scope: attention

Weight Averaging

EMA

parameters: {"decay":0.9999,"start_step":500}

Other

other

QAT warmup delay before enabling fake quantization

parameters: {"warmup_steps":500}

Baseline reproduction of PR #180 at val_bpb 1.1426
Ablation showing that QAT improves artifact compression but reduces training throughput enough to hurt validation performance under the 10-minute budget
Ablation showing that EMA causes severe throughput loss due to CPU cloning every step
Demonstration that step-budget-constrained training makes throughput-costly techniques counterproductive