PR #989

open

QAT x SWA Ablation: SWA sabotages QAT (-3.64 mBPB, 3-seed validated)

by alexanderaperry-archView on GitHub

val_bpb

1.1402

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,787,003 bytes

Training Techniques

Quantization

QAT

bits: 6

scope: all

STE QAT

bits: 6

scope: all

Weight Averaging

SWA

parameters: {"start_step":4550}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"adamw":true}

AdamW

weight_decay: 0.04

momentum: null

other_params: null

Architecture

MLP3x

3x MLP width in the transformer stack

parameters: {"multiplier":3,"hidden_dim":1536}

GQA

Grouped query attention with fewer KV heads than attention heads

parameters: {"heads":8,"kv_heads":4}

weight tying

Tied input and output embeddings

parameters: null

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Regularization

magnitude pruning

parameters: {"pct":10}

Novel Contributions

Systematic 2x2 factorial ablation of QAT and SWA on the PR #180 stack
3-seed validation showing QAT without SWA outperforms the SWA control by 3.64 mBPB
Evidence that SWA and QAT are antagonistic under the competition's short wallclock and artifact constraints
Demonstration that QAT configurations require more aggressive pruning to fit under the 16MB limit
Argument that post-quantization BPB is the relevant metric for QAT, not training val_bpb