PR #1385

open

Non-record: Compressor-Aware Training (CAT), differentiable compression proxies for LZ-family compressors

val_bpb

1.4465

Architecture

Transformer

Optimizer

—

Artifact Size

11.48 MB

Training Techniques

Quantization

STE QAT

bits: 8

scope: all

Compression

zlib

level: null

Architecture

GQA

Transformer uses grouped query attention with 14 attention heads and 2 KV heads.

parameters: {"heads":14,"kv_heads":2}

depth recurrence

4 physical transformer layers are looped 3 times for 12 effective layers.

parameters: {"physical_layers":4,"loops":3,"effective_layers":12}

LoRA

Per-loop LoRA adapters are used in the recurrent transformer.

parameters: {"rank":16}

LR Schedule

warmdown

parameters: {"cooldown":true}

Regularization

compression-aware regularization

parameters: {"lambda_lz":0.01,"lambda_h":0.1}

Differentiable proxy losses for LZ-family dictionary matching and entropy coding during training
Multi-lag soft autocorrelation proxy for LZ77-style dictionary matching on serialized quantized weights
Soft histogram Shannon entropy proxy for entropy coding friendliness
Compression-aware training that jointly optimizes model quality and artifact compressibility
Reported artifact size reduction from 12.32 MB to 11.48 MB on 1xH100