PR #1664

open

WIP: Sequential GPTQ with Groupwise Int6 — improved post-training quantization on SP4096 base

by zoharb157View on GitHub
val_bpb
1.0980
Architecture
Transformer
Optimizer
Artifact Size
16MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: all
mixed int6
bits: 6
scope: all
Other
other
Sequential cross-layer GPTQ propagation: quantize layers one at a time, inject quantized weights back into the model, and collect Hessians for later layers using quantized activations.
parameters: {"enabled":true}
other
Groupwise int6 scales with group size 128, using per-group fp16 scales instead of per-row scales.
parameters: {"group_size":128}
other
Hessian-weighted scale selection that minimizes weighted reconstruction error using Hessian diagonal terms.
parameters: null

Novel Contributions

  • Sequential cross-layer GPTQ propagation
  • Groupwise int6 scales with group_size=128
  • Hessian-weighted scale selection
  • Post-training quantization improvements with zero training-time cost