PR #238

open

[Non-record] Quantization Findings: SWA Reversal + Int5 Failure

by kellyvvView on GitHub

val_bpb

1.5164

Architecture

Transformer

Optimizer

Muon

Artifact Size

10.5 MB

Training Techniques

Architecture

MLP3x

3x-expanded MLP layers in the baseline architecture

parameters: {"multiplier":3}

SmearGate

Uses SmearGate in the model architecture

parameters: null

BigramHash

Uses BigramHash in the model architecture

parameters: null

KV head count

Uses grouped-query attention with 8 attention heads and 4 KV heads

parameters: {"num_heads":8,"num_kv_heads":4}

Quantization

int6

bits: 6

scope: all

mixed int5/int6

bits: 5

scope: MLP

STE QAT

bits: null

scope: all

Weight Averaging

SWA

parameters: {"num_checkpoints":84,"every_steps":50,"start_step":6481}

Compression

zstd

level: null

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Regularization

gradient clipping

parameters: {"grad_clip_norm":0.3}

Other

other

Training under compute constraints with 10,670 steps on 1xH100

parameters: {"steps":10670,"hardware":"1xH100"}

Novel Contributions

Demonstrates that SWA can reverse the quantization gap, producing a lower int6+zstd BPB than the pre-quantization checkpoint
Shows that int5 quantization of MLP layers can be catastrophic for undertrained models, greatly increasing the quantization gap
Provides evidence that SWA and quantization can be synergistic rather than antagonistic
Argues against mixed int5/int6 viability for compute-constrained training in this setting