PR #1805

open

SP8192 + Compression-Aware QAT on PR #1493, 3-seed val_bpb 1.10314

val_bpb
1.1031
Architecture
Transformer
Optimizer
Artifact Size
15,999,417 B

Training Techniques

Quantization
QAT
bits: 6
scope: large 2D linear matrices
Regularization
entropy penalty
parameters: {"target":"soft int6 histogram","lambda":0.001,"beta":10,"warmup":200}
Compression
zstd
level: null

Novel Contributions

  • Compression-aware QAT with a differentiable entropy surrogate over soft int6 histograms
  • Applying the surrogate after warmup to large 2D linear matrices only
  • Demonstrating stable cross-seed behavior for compression-aware training
  • Research pivot from 3DCF-style compression ideas to a scoreable CompQAT branch on top of PR #1493