PR #1385

open

Non-record: Compressor-Aware Training (CAT), differentiable compression proxies for LZ-family compressors

by korentomasView on GitHub
val_bpb
1.4465
Architecture
Transformer
Optimizer
Artifact Size
11.48 MB

Training Techniques

Quantization
STE QAT
bits: 8
scope: all
Compression
zlib
level: null
Architecture
GQA
Transformer uses grouped query attention with 14 attention heads and 2 KV heads.
parameters: {"heads":14,"kv_heads":2}
depth recurrence
4 physical transformer layers are looped 3 times for 12 effective layers.
parameters: {"physical_layers":4,"loops":3,"effective_layers":12}
LoRA
Per-loop LoRA adapters are used in the recurrent transformer.
parameters: {"rank":16}
LR Schedule
warmdown
parameters: {"cooldown":true}
Regularization
compression-aware regularization
parameters: {"lambda_lz":0.01,"lambda_h":0.1}

Novel Contributions

  • Differentiable proxy losses for LZ-family dictionary matching and entropy coding during training
  • Multi-lag soft autocorrelation proxy for LZ77-style dictionary matching on serialized quantized weights
  • Soft histogram Shannon entropy proxy for entropy coding friendliness
  • Compression-aware training that jointly optimizes model quality and artifact compressibility
  • Reported artifact size reduction from 12.32 MB to 11.48 MB on 1xH100