PR #1515

open

Non-Record: SP8192 + LeanICQ Compose at Int3 — val_bpb 1.08720 / 15.88 MB

by dexhunterView on GitHub

val_bpb

1.0872

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.88 MB

Training Techniques

Quantization

GPTQ

bits: 8

scope: embeddings

LeanICQ int3

bits: 3

scope: matrix weights

ICQuant

bits: 8

scope: outliers

Architecture

weight tying

Tied embeddings are used.

parameters: null

LeakyReLU

LeakyReLU activation used in the MLP.

parameters: {"slope":0.5}

Partial RoPE

Partial rotary position embeddings are used.

parameters: {"dimensions":"16/64"}

GQA

Grouped-query attention with fewer KV heads than query heads.

parameters: {"heads":8,"kv_heads":4}

parallel residuals

Parallel residual connections enabled from layer 7 onward.

parameters: {"start_layer":7}

Weight Averaging

EMA

parameters: {"decay":0.997}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"variant":"MuonEq-R"}

AdamW

weight_decay: null

momentum: null

other_params: {"scope":"embeddings and scalars"}

SGD

weight_decay: null

momentum: 0.9

other_params: {"learning_rate":0.005}

Evaluation

sliding window eval

parameters: null

Test-Time Training

score-first TTT

parameters: {"epochs_per_chunk":3,"learning_rate":0.005,"momentum":0.9}

Regularization

logit softcap

parameters: {"value":30}

layerwise LN scale

parameters: null

Novel Contributions

First reported composition of LeanQuant centroids with ICQuant outlier extraction on this stack
Aggressive int3 matrix quantization with Hessian-weighted k-means centroids
Top-2% magnitude outlier extraction stored separately as int8
Packed 3-bit centroid index bitstream storage
Measured Pareto frontier showing the int3 configuration is the only one fitting under 16 MB