PR #606

open

Record: int5 GPTQ + Soft-Round QAT (3-seed mean 1.1162)

by EthanYangTWView on GitHub
val_bpb
1.1162
Architecture
Transformer
Optimizer
AdamW
Artifact Size
under 16MB

Training Techniques

Quantization
int5 GPTQ
bits: 5
scope: all
Architecture
XSA
Cross Self-Attention on all 11 layers
parameters: {"layers":11}
SmearGate
Gating mechanism added to architecture
parameters: null
BigramHash
Bigram hashing with 8192 buckets
parameters: {"buckets":8192}
Partial RoPE
Rotary Positional Embeddings applied partially with 16/64
parameters: {"partial_rope":"16/64"}
MLP
MLP scaled 3.5x with relu² activation
parameters: {"scale":3.5,"activation":"relu²"}
Optimizer
AdamW
weight_decay: 0
momentum: null
other_params: {"lr":0.0001}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"frequency":50}
Compression
zstd
level: 22
Test-Time Training
score-first TTT
parameters: {"chunk_tokens":131072,"epochs_per_chunk":3,"optimizer":"AdamW","learning_rate":0.0001,"weight_decay":0,"unfrozen_params":"last 2 blocks + norms + lm_head (~5.8M / 33.6M)","cosine_lr_decay":true,"every_token_scored_before_update":true}
Initialization
OrthoInit
Orthogonal initialization of weights
Regularization
weight decay
parameters: {"value":0.04}
LR Schedule
cosine decay
parameters: null
Other
other
Soft-Round QAT: differentiable tanh-based rounding replacing STE with alpha annealing from 1 to 16
parameters: null
other
GPTQ error compensation with Hessian-aware column reordering and Cholesky error redistribution using 256-sample calibration
parameters: null
other
Early QAT clipping at threshold 0.5 matched to int5 range
parameters: {"threshold":0.5,"QAT_steps":1750}

Novel Contributions

  • int5 quantization with 31 unique values stored as int8 enabling 33.6M parameter model under 16MB
  • Hessian-aware GPTQ error compensation with column reordering and Cholesky error redistribution
  • Soft-Round QAT using differentiable tanh-based rounding replacing STE with alpha annealing from 1 to 16
  • Legal score-first test-time training (TTT) with AdamW optimizer and cosine LR decay across chunks
  • Combination of early QAT clipping at 0.5 threshold and EMA with decay 0.997
  • Use of BigramHash 8192 and Partial RoPE 16/64 in architecture
  • Achieving 33.6M parameters with int5 quantization and 2% magnitude pruning fitting under 16MB