PR #975

open

Non-record: QNA + SQWA compression thesis (8xH100 SXM)

by Abhishek8108View on GitHub

val_bpb

1.1216

Architecture

Transformer

Optimizer

Muon

Artifact Size

16.15 MB

Training Techniques

Quantization

late QAT

bits: 6

scope: all

QAT

bits: 6

scope: all

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"start_step":null,"every_steps":50}

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Architecture

GQA

Grouped query attention with fewer KV heads than query heads

parameters: {"query_heads":8,"kv_heads":4}

ReLU²

Squared ReLU activation

parameters: null

LeakyReLU

Leaky ReLU activation

parameters: {"slope":0.5}

XSA

XSA applied to the last layers

parameters: {"layers":4}

Partial RoPE

Rotary position embeddings applied to a subset of dimensions

parameters: {"dimensions":16,"base_dimensions":64}

LN Scale

LayerNorm scale modification

parameters: null

BigramHash

Bigram hash embedding feature

parameters: {"vocab_size":2048,"dim":128}

SmearGate

SmearGate gating mechanism

parameters: null

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

LR Schedule

warmdown

parameters: {"warmdown_steps":null}

Regularization

LN scale

parameters: null

Novel Contributions

Quantization Noise Annealing (QNA) to inject int6-like noise during training
Stochastic Quantized Weight Averaging (SQWA) using quantize-dequantize EMA snapshots
Controlled 3-run ablation showing reduced quantization gap without improving final val_bpb
Analysis that float model quality, not quantization error, is the main bottleneck