PR #1335

open

VQ-VAE Weight Compression (non-record track)

by WeijieChen2017View on GitHub

val_bpb

1.1948

Architecture

Transformer

Optimizer

Muon

Artifact Size

~16.0 MB

Training Techniques

Quantization

bits: 5

scope: weights

Weight Averaging

SWA

parameters: {"start_step":null,"window":"last 3% of warmdown"}

EMA

parameters: {"decay":0.95}

Optimizer

Muon

weight_decay: null

momentum: 0.9

other_params: {"lr":0.08}

Architecture

GQA

Grouped query attention with 4 KV heads

parameters: {"heads":8,"kv_heads":4}

weight tying

Tied input and output embeddings

parameters: {"vocab":1024}

Partial RoPE

Rotary position embeddings applied to a subset of dimensions

parameters: {"dimensions":"16/64"}

SmearGate

Token blending gate

parameters: null

U-Net skip connections

Skip connections between encoder and decoder halves

parameters: null

Initialization

OrthoInit

Orthogonal initialization

Regularization

logit softcap

parameters: {"value":8}

LR Schedule

warmdown

parameters: {"warmdown_steps":null,"fraction":0.7}

Compression

zstd

level: 22

Novel Contributions

Vector quantization replaces scalar INT5 weight quantization
Codebook is trained during warmdown with EMA updates and periodic snapping
Cosine-sphere codebook with 10-bit packed indices for 1024-entry VQ
Improved quantization delta versus INT5 at the same bitrate
All artifacts fit under the 16MB limit