PR #1357

open

Non-record: 12L Compression-Aware Training Orchestration with ProxQuant

by mollahasaniView on GitHub
val_bpb
1.2200
Architecture
Transformer
Optimizer
Muon
Artifact Size
13-17 MB

Training Techniques

Architecture
Transformer
12-layer transformer with 3x MLP, GQA, Partial RoPE, tied embeddings, BigramHash, LeakyReLU^2, and U-Net skip connections.
parameters: {"layers":12,"model_dim":512,"attention_heads":8,"kv_heads":4,"mlp_multiplier":3,"mlp_hidden":1536,"rope_dims":"16/64","vocab_size":1024,"bigram_buckets":1536}
Quantization
QAT
bits: 6
scope: all
STE QAT
bits: 6
scope: all
ProxQuant
bits: 6
scope: all
Regularization
magnitude pruning
parameters: {"sparsity":"15-22%","schedule":"cubic"}
weight decay
parameters: {"initial":0.04,"final":0.08}
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.025}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
Multi-phase training orchestration combining clean training, gradual pruning, QAT, PERP recovery, and serialization-aware neuron reordering.
parameters: {"phases":5}
other
PERP recovery by retraining biases and layer norms after compression.
parameters: {"steps":200}
other
Neuron reordering by sorting MLP hidden neurons by L1 norm before serialization to improve lossless compression.
parameters: null

Novel Contributions

  • Multi-phase training orchestration coordinating pruning, quantization, PERP recovery, and serialization
  • ProxQuant progressive QAT with gradual grid annealing
  • Prune-before-quantize scheduling based on the Progressive Intensity Hypothesis
  • PERP post-compression recovery of biases and layer norms
  • Neuron reordering to improve lossless compression
  • Using a 12-layer architecture under the 16MB budget