PR #1773
openNon-record: SDClip-matched FakeQuantize — reduces quant degradation from +0.17 to +0.044
by AmanbigView on GitHub
val_bpb
1.1872
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Quantization
QAT
bits: null
scope: MLP/attention/embeddings
mixed int5/int6/int8
bits: null
scope: MLP/attention/embeddings
Architecture
GQA
Grouped query attention with fewer KV heads than query heads.
parameters: {"heads":8,"kv_heads":4}
Partial RoPE
Partial rotary positional embeddings applied to a subset of dimensions.
parameters: {"dimensions":"16/64"}
BigramHash
Bigram hash-based embedding/feature mechanism.
parameters: {"size":4096}
SmearGate
Gating mechanism used in the model.
parameters: null
depth recurrence
Recurrent depth reuse across selected layers.
parameters: {"layers":[3,4,5],"activation":"35%"}
weight tying
Tied embedding-related weights are implied by the model lineage.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.9965,"start":"50%"}
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"variant":"MuonEq-R","matrix_lr":0.022}
SGD
weight_decay: null
momentum: 0.9
other_params: {"learning_rate":0.005}
LR Schedule
warmdown
parameters: {"warmdown_start":"72%"}
Test-Time Training
score-first TTT
parameters: {"optimizer":"SGD","learning_rate":0.005,"momentum":0.9,"epochs":3,"schedule":"cosine"}
Compression
Brotli
level: 11
Novel Contributions
- SDClip-matched FakeQuantize to align QAT clipping with save-time quantization
- Reduced quantization degradation from about +0.17 BPB to about +0.044 BPB
- Demonstrated that QAT/save-time quantizer mismatch can cause post-quant collapse
- Applied the same SDClip formula during FakeQuantize as used at save time