PR #989
openQAT x SWA Ablation: SWA sabotages QAT (-3.64 mBPB, 3-seed validated)
by alexanderaperry-archView on GitHub
val_bpb
1.1402
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,787,003 bytes
Training Techniques
Quantization
QAT
bits: 6
scope: all
STE QAT
bits: 6
scope: all
Weight Averaging
SWA
parameters: {"start_step":4550}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"adamw":true}
AdamW
weight_decay: 0.04
momentum: null
other_params: null
Architecture
MLP3x
3x MLP width in the transformer stack
parameters: {"multiplier":3,"hidden_dim":1536}
GQA
Grouped query attention with fewer KV heads than attention heads
parameters: {"heads":8,"kv_heads":4}
weight tying
Tied input and output embeddings
parameters: null
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Regularization
magnitude pruning
parameters: {"pct":10}
Novel Contributions
- Systematic 2x2 factorial ablation of QAT and SWA on the PR #180 stack
- 3-seed validation showing QAT without SWA outperforms the SWA control by 3.64 mBPB
- Evidence that SWA and QAT are antagonistic under the competition's short wallclock and artifact constraints
- Demonstration that QAT configurations require more aggressive pruning to fit under the 16MB limit
- Argument that post-quantization BPB is the relevant metric for QAT, not training val_bpb