PR #1773

open

Non-record: SDClip-matched FakeQuantize — reduces quant degradation from +0.17 to +0.044

by AmanbigView on GitHub

val_bpb

1.1872

Architecture

Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Quantization

QAT

bits: null

scope: MLP/attention/embeddings

mixed int5/int6/int8

bits: null

scope: MLP/attention/embeddings

Architecture

GQA

Grouped query attention with fewer KV heads than query heads.

parameters: {"heads":8,"kv_heads":4}

Partial RoPE

Partial rotary positional embeddings applied to a subset of dimensions.

parameters: {"dimensions":"16/64"}

BigramHash

Bigram hash-based embedding/feature mechanism.

parameters: {"size":4096}

SmearGate

Gating mechanism used in the model.

parameters: null

depth recurrence

Recurrent depth reuse across selected layers.

parameters: {"layers":[3,4,5],"activation":"35%"}

weight tying

Tied embedding-related weights are implied by the model lineage.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.9965,"start":"50%"}

Optimizer

Muon

weight_decay: 0.095

momentum: null

other_params: {"variant":"MuonEq-R","matrix_lr":0.022}

SGD

weight_decay: null

momentum: 0.9

other_params: {"learning_rate":0.005}

LR Schedule

warmdown

parameters: {"warmdown_start":"72%"}

Test-Time Training

score-first TTT

parameters: {"optimizer":"SGD","learning_rate":0.005,"momentum":0.9,"epochs":3,"schedule":"cosine"}

Compression

Brotli

level: 11

Novel Contributions

SDClip-matched FakeQuantize to align QAT clipping with save-time quantization
Reduced quantization degradation from about +0.17 BPB to about +0.044 BPB
Demonstrated that QAT/save-time quantizer mismatch can cause post-quant collapse
Applied the same SDClip formula during FakeQuantize as used at save time