PR #1335

open

VQ-VAE Weight Compression (non-record track)

by WeijieChen2017View on GitHub
val_bpb
1.1948
Architecture
Transformer
Optimizer
Muon
Artifact Size
~16.0 MB

Training Techniques

Quantization
VQ
bits: 5
scope: weights
Weight Averaging
SWA
parameters: {"start_step":null,"window":"last 3% of warmdown"}
EMA
parameters: {"decay":0.95}
Optimizer
Muon
weight_decay: null
momentum: 0.9
other_params: {"lr":0.08}
Architecture
GQA
Grouped query attention with 4 KV heads
parameters: {"heads":8,"kv_heads":4}
weight tying
Tied input and output embeddings
parameters: {"vocab":1024}
Partial RoPE
Rotary position embeddings applied to a subset of dimensions
parameters: {"dimensions":"16/64"}
SmearGate
Token blending gate
parameters: null
U-Net skip connections
Skip connections between encoder and decoder halves
parameters: null
Initialization
OrthoInit
Orthogonal initialization
Regularization
logit softcap
parameters: {"value":8}
LR Schedule
warmdown
parameters: {"warmdown_steps":null,"fraction":0.7}
Compression
zstd
level: 22

Novel Contributions

  • Vector quantization replaces scalar INT5 weight quantization
  • Codebook is trained during warmdown with EMA updates and periodic snapping
  • Cosine-sphere codebook with 10-bit packed indices for 1024-entry VQ
  • Improved quantization delta versus INT5 at the same bitrate
  • All artifacts fit under the 16MB limit