PR #1515
openNon-Record: SP8192 + LeanICQ Compose at Int3 — val_bpb 1.08720 / 15.88 MB
by dexhunterView on GitHub
val_bpb
1.0872
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.88 MB
Training Techniques
Quantization
GPTQ
bits: 8
scope: embeddings
LeanICQ int3
bits: 3
scope: matrix weights
ICQuant
bits: 8
scope: outliers
Architecture
weight tying
Tied embeddings are used.
parameters: null
LeakyReLU
LeakyReLU activation used in the MLP.
parameters: {"slope":0.5}
Partial RoPE
Partial rotary position embeddings are used.
parameters: {"dimensions":"16/64"}
GQA
Grouped-query attention with fewer KV heads than query heads.
parameters: {"heads":8,"kv_heads":4}
parallel residuals
Parallel residual connections enabled from layer 7 onward.
parameters: {"start_layer":7}
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"variant":"MuonEq-R"}
AdamW
weight_decay: null
momentum: null
other_params: {"scope":"embeddings and scalars"}
SGD
weight_decay: null
momentum: 0.9
other_params: {"learning_rate":0.005}
Evaluation
sliding window eval
parameters: null
Test-Time Training
score-first TTT
parameters: {"epochs_per_chunk":3,"learning_rate":0.005,"momentum":0.9}
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: null
Novel Contributions
- First reported composition of LeanQuant centroids with ICQuant outlier extraction on this stack
- Aggressive int3 matrix quantization with Hessian-weighted k-means centroids
- Top-2% magnitude outlier extraction stored separately as int8
- Packed 3-bit centroid index bitstream storage
- Measured Pareto frontier showing the int3 configuration is the only one fitting under 16 MB