PR #1417

open

Mixed INT5/INT6 QAT from step 1 (1.3039 bpb)

by BruhTheMomentumView on GitHub

val_bpb

1.3039

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.4MB

Training Techniques

Quantization

mixed int5/int6

bits: 5

scope: MLP weights

mixed int5/int6

bits: 6

scope: attention weights

QAT

bits: null

scope: all weights

STE QAT

bits: null

scope: all weights

Compression

zstd

level: 22

Architecture

GQA

Uses grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

U-Net skip connections

Adds U-Net style skip connections to the model.

parameters: null

weight tying

Ties input and output embeddings.

parameters: null

MLP3x

Uses a 3x MLP expansion.

parameters: {"expansion":3}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"Adam_for":"embeddings/scalars"}

Regularization

weight decay

parameters: {"value":0.04}

Novel Contributions

Mixed INT5/INT6 quantization with INT5 for MLP weights and INT6 for attention weights
Quantization-aware training from step 1 using fake-quantized forward passes and STE
Entropy-aware compression perspective showing QAT reduces weight entropy and improves compressibility
Demonstrated that early QAT substantially outperforms late QAT for post-quantization quality