PR #1543

open

PTQ int6-attn + int5-mlp, 20L×256d, mlp=5 — val_bpb 1.3286

by PavelPahaView on GitHub
val_bpb
1.3286
Architecture
Transformer
Optimizer
Muon
Artifact Size
15 MB

Training Techniques

Architecture
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
weight tying
Tied input embeddings and output head embeddings.
parameters: null
Quantization
mixed int6/int5
bits: null
scope: attention weights and MLP weights
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"adam":true}
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_iterations":1200,"shape":"trapezoid"}
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null

Novel Contributions

  • Post-training quantization with int6 attention weights and int5 MLP weights
  • Muon+Adam optimization with trapezoid learning-rate schedule
  • Grouped query attention with tied embeddings in a 20-layer GPT-like model
  • Int8 plus zlib compression to fit the final artifact within the size limit