PR #585

closed

Record: int5 GPTQ + 33.6M model (3-seed mean val_bpb=1.1179)

by EthanYangTWView on GitHub
val_bpb
1.1179
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.53 MB, 15.36 MB, 15.28 MB

Training Techniques

Quantization
GPTQ
bits: 5
scope: all weights
Architecture
BigramHash
Uses BigramHash with size 8192 as part of the model architecture.
parameters: {"size":8192}
KV head count
Uses full attention with 8 attention heads and 8 KV heads (MHA 8/8).
parameters: {"heads":8,"kv_heads":8}
MLP3x
Expanded MLP width to 3.5x (reported as 1792).
parameters: {"multiplier":3.5,"hidden_dim":1792}
Partial RoPE
Applies partial rotary positional embeddings.
parameters: {"ratio":"16/64"}
XSA
XSA applied on all 11 layers.
parameters: {"layers":11}
SmearGate
Uses SmearGate as part of the model design.
parameters: null
weight tying
Shared VE128 in layers 9 and 10.
parameters: {"layers":[9,10]}
layerwise LN scale
Uses LN scale of 1/sqrt(layer+1).
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"lr":0.025}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"frequency":50}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":32}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.0001,"weight_decay":0,"epochs_per_chunk":"2-3","chunk_tokens":131072}
Initialization
OrthoInit
Orthogonal initialization used.
Sequence Length
sequence_length
train_length: 131072
eval_length: null
LR Schedule
cosine decay
parameters: {"across_chunks":true}
Regularization
weight decay
parameters: {"value":0.04}
Other
other
Early QAT with int5 clipping and GPTQ Hessian-aware error compensation; legal score-first test-time training where tokens are scored before any gradient update.
parameters: {"qat_threshold":0.5,"calibration_samples":256,"prune_pct":0.02}

Novel Contributions

  • int5 quantization with GPTQ error compensation to fit a 33.6M parameter model under 16MB
  • Legal score-first TTT where every token is scored before any gradient update
  • Early QAT tuned to int5 clipping range
  • Use of a larger 33.6M model enabled by improved compression efficiency
  • Combination of GPTQ, pruning, and zstd compression to achieve all artifacts under 16MB