PR #1357

open

Non-record: 12L Compression-Aware Training Orchestration with ProxQuant

by mollahasaniView on GitHub

val_bpb

1.2200

Architecture

Transformer

Optimizer

Muon

Artifact Size

13-17 MB

Training Techniques

Architecture

Transformer

12-layer transformer with 3x MLP, GQA, Partial RoPE, tied embeddings, BigramHash, LeakyReLU^2, and U-Net skip connections.

parameters: {"layers":12,"model_dim":512,"attention_heads":8,"kv_heads":4,"mlp_multiplier":3,"mlp_hidden":1536,"rope_dims":"16/64","vocab_size":1024,"bigram_buckets":1536}

Quantization

QAT

bits: 6

scope: all

STE QAT

bits: 6

scope: all

ProxQuant

bits: 6

scope: all

Regularization

magnitude pruning

parameters: {"sparsity":"15-22%","schedule":"cubic"}

weight decay

parameters: {"initial":0.04,"final":0.08}

Weight Averaging

EMA

parameters: {"decay":0.997}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"lr":0.025}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Other

other

Multi-phase training orchestration combining clean training, gradual pruning, QAT, PERP recovery, and serialization-aware neuron reordering.

parameters: {"phases":5}

other

PERP recovery by retraining biases and layer norms after compression.

parameters: {"steps":200}

other

Neuron reordering by sorting MLP hidden neurons by L1 norm before serialization to improve lossless compression.

parameters: null

Novel Contributions

Multi-phase training orchestration coordinating pruning, quantization, PERP recovery, and serialization
ProxQuant progressive QAT with gradual grid annealing
Prune-before-quantize scheduling based on the Progressive Intensity Hypothesis
PERP post-compression recovery of biases and layer norms
Neuron reordering to improve lossless compression
Using a 12-layer architecture under the 16MB budget