PR #238

open

[Non-record] Quantization Findings: SWA Reversal + Int5 Failure

by kellyvvView on GitHub
val_bpb
1.5164
Architecture
Transformer
Optimizer
Muon
Artifact Size
10.5 MB

Training Techniques

Architecture
MLP3x
3x-expanded MLP layers in the baseline architecture
parameters: {"multiplier":3}
SmearGate
Uses SmearGate in the model architecture
parameters: null
BigramHash
Uses BigramHash in the model architecture
parameters: null
KV head count
Uses grouped-query attention with 8 attention heads and 4 KV heads
parameters: {"num_heads":8,"num_kv_heads":4}
Quantization
int6
bits: 6
scope: all
mixed int5/int6
bits: 5
scope: MLP
STE QAT
bits: null
scope: all
Weight Averaging
SWA
parameters: {"num_checkpoints":84,"every_steps":50,"start_step":6481}
Compression
zstd
level: null
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Regularization
gradient clipping
parameters: {"grad_clip_norm":0.3}
Other
other
Training under compute constraints with 10,670 steps on 1xH100
parameters: {"steps":10670,"hardware":"1xH100"}

Novel Contributions

  • Demonstrates that SWA can reverse the quantization gap, producing a lower int6+zstd BPB than the pre-quantization checkpoint
  • Shows that int5 quantization of MLP layers can be catastrophic for undertrained models, greatly increasing the quantization gap
  • Provides evidence that SWA and quantization can be synergistic rather than antagonistic
  • Argues against mixed int5/int6 viability for compute-constrained training in this setting