PR #975

open

Non-record: QNA + SQWA compression thesis (8xH100 SXM)

by Abhishek8108View on GitHub
val_bpb
1.1216
Architecture
Transformer
Optimizer
Muon
Artifact Size
16.15 MB

Training Techniques

Quantization
late QAT
bits: 6
scope: all
QAT
bits: 6
scope: all
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"start_step":null,"every_steps":50}
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Architecture
GQA
Grouped query attention with fewer KV heads than query heads
parameters: {"query_heads":8,"kv_heads":4}
ReLU²
Squared ReLU activation
parameters: null
LeakyReLU
Leaky ReLU activation
parameters: {"slope":0.5}
XSA
XSA applied to the last layers
parameters: {"layers":4}
Partial RoPE
Rotary position embeddings applied to a subset of dimensions
parameters: {"dimensions":16,"base_dimensions":64}
LN Scale
LayerNorm scale modification
parameters: null
BigramHash
Bigram hash embedding feature
parameters: {"vocab_size":2048,"dim":128}
SmearGate
SmearGate gating mechanism
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
LR Schedule
warmdown
parameters: {"warmdown_steps":null}
Regularization
LN scale
parameters: null

Novel Contributions

  • Quantization Noise Annealing (QNA) to inject int6-like noise during training
  • Stochastic Quantized Weight Averaging (SQWA) using quantize-dequantize EMA snapshots
  • Controlled 3-run ablation showing reduced quantization gap without improving final val_bpb
  • Analysis that float model quality, not quantization error, is the main bottleneck