PR #606

open

Record: int5 GPTQ + Soft-Round QAT (3-seed mean 1.1162)

by EthanYangTWView on GitHub

val_bpb

1.1162

Architecture

Transformer

Optimizer

AdamW

Artifact Size

under 16MB

Training Techniques

Quantization

int5 GPTQ

bits: 5

scope: all

Architecture

XSA

Cross Self-Attention on all 11 layers

parameters: {"layers":11}

SmearGate

Gating mechanism added to architecture

parameters: null

BigramHash

Bigram hashing with 8192 buckets

parameters: {"buckets":8192}

Partial RoPE

Rotary Positional Embeddings applied partially with 16/64

parameters: {"partial_rope":"16/64"}

MLP

MLP scaled 3.5x with relu² activation

parameters: {"scale":3.5,"activation":"relu²"}

Optimizer

AdamW

weight_decay: 0

momentum: null

other_params: {"lr":0.0001}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"frequency":50}

Compression

zstd

level: 22

Test-Time Training

score-first TTT

parameters: {"chunk_tokens":131072,"epochs_per_chunk":3,"optimizer":"AdamW","learning_rate":0.0001,"weight_decay":0,"unfrozen_params":"last 2 blocks + norms + lm_head (~5.8M / 33.6M)","cosine_lr_decay":true,"every_token_scored_before_update":true}

Initialization

OrthoInit

Orthogonal initialization of weights

Regularization

weight decay

parameters: {"value":0.04}

LR Schedule

cosine decay

parameters: null

Other

other

Soft-Round QAT: differentiable tanh-based rounding replacing STE with alpha annealing from 1 to 16

parameters: null

other

GPTQ error compensation with Hessian-aware column reordering and Cholesky error redistribution using 256-sample calibration

parameters: null

other

Early QAT clipping at threshold 0.5 matched to int5 range

parameters: {"threshold":0.5,"QAT_steps":1750}

Novel Contributions

int5 quantization with 31 unique values stored as int8 enabling 33.6M parameter model under 16MB
Hessian-aware GPTQ error compensation with column reordering and Cholesky error redistribution
Soft-Round QAT using differentiable tanh-based rounding replacing STE with alpha annealing from 1 to 16
Legal score-first test-time training (TTT) with AdamW optimizer and cosine LR decay across chunks
Combination of early QAT clipping at 0.5 threshold and EMA with decay 0.997
Use of BigramHash 8192 and Partial RoPE 16/64 in architecture
Achieving 33.6M parameters with int5 quantization and 2% magnitude pruning fitting under 16MB