PR #1258

open

11L INT7 + MuonWD + SWA (preliminary)

by jorge-asenjoView on GitHub
val_bpb
1.3874
Architecture
Transformer
Optimizer
Muon
Artifact Size
10.5 MB

Training Techniques

Quantization
int7
bits: 7
scope: all
Architecture
weight tying
Tied embeddings used in the transformer.
parameters: null
KV head count
Uses fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Weight Averaging
SWA
parameters: {"start_pct":2}
Regularization
weight decay
parameters: {"value":0.04,"applied_to":"matrix parameters"}
Compression
zlib
level: null

Novel Contributions

  • INT7 quantization as a sweet spot between INT8 and INT6
  • Muon weight decay applied to matrix parameters to improve quantization
  • SWA checkpoint averaging
  • Per-row INT7 quantization with percentile clipping and zlib compression
  • Architecture search across depth, width, quantization, QAT, recurrence, weight decay, and SWA