PR #585

closed

Record: int5 GPTQ + 33.6M model (3-seed mean val_bpb=1.1179)

by EthanYangTWView on GitHub

val_bpb

1.1179

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15.53 MB, 15.36 MB, 15.28 MB

Training Techniques

Quantization

GPTQ

bits: 5

scope: all weights

Architecture

BigramHash

Uses BigramHash with size 8192 as part of the model architecture.

parameters: {"size":8192}

KV head count

Uses full attention with 8 attention heads and 8 KV heads (MHA 8/8).

parameters: {"heads":8,"kv_heads":8}

MLP3x

Expanded MLP width to 3.5x (reported as 1792).

parameters: {"multiplier":3.5,"hidden_dim":1792}

Partial RoPE

Applies partial rotary positional embeddings.

parameters: {"ratio":"16/64"}

XSA

XSA applied on all 11 layers.

parameters: {"layers":11}

SmearGate

Uses SmearGate as part of the model design.

parameters: null

weight tying

Shared VE128 in layers 9 and 10.

parameters: {"layers":[9,10]}

layerwise LN scale

Uses LN scale of 1/sqrt(layer+1).

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"lr":0.025}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"frequency":50}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":32}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.0001,"weight_decay":0,"epochs_per_chunk":"2-3","chunk_tokens":131072}

Initialization

OrthoInit

Orthogonal initialization used.

Sequence Length

sequence_length

train_length: 131072

eval_length: null

LR Schedule

cosine decay

parameters: {"across_chunks":true}

Regularization

weight decay

parameters: {"value":0.04}

Other

other

Early QAT with int5 clipping and GPTQ Hessian-aware error compensation; legal score-first test-time training where tokens are scored before any gradient update.

parameters: {"qat_threshold":0.5,"calibration_samples":256,"prune_pct":0.02}

Novel Contributions

int5 quantization with GPTQ error compensation to fit a 33.6M parameter model under 16MB
Legal score-first TTT where every token is scored before any gradient update
Early QAT tuned to int5 clipping range
Use of a larger 33.6M model enabled by improved compression efficiency
Combination of GPTQ, pruning, and zstd compression to achieve all artifacts under 16MB