PR #544

closed

int5 GPTQ + 33.6M model: 1.1179 BPB (3-seed mean)

by EthanYangTWView on GitHub

val_bpb

1.1179

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15.53 MB

Training Techniques

Quantization

int5 GPTQ

bits: 5

scope: per-row all weights

Early QAT

bits: null

scope: all

Architecture

XSA

XSA applied to all layers

parameters: {"layers":11}

BigramHash

BigramHash embedding/hash component

parameters: {"dimensions":8192}

MLP3.5x

Expanded MLP width to 3.5x

parameters: {"hidden_size":1792}

Optimizer

AdamW

weight_decay: null

momentum: null

other_params: {"score_first":true}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"start_step":5450}

Compression

zstd

level: null

Evaluation

sliding window eval

parameters: {"stride":32}

Test-Time Training

score-first AdamW TTT

parameters: {"chunk_tokens":131072,"epochs":3,"learning_rate":0.0001,"freeze_blocks":2,"stride":32}

Regularization

magnitude pruning

parameters: {"pct":0.02}

Novel Contributions

First submission to achieve int5 quantization on a 33.6M model within the artifact size limit
GPTQ error compensation for int5 per-row quantization
Early QAT with threshold 0.5 and EMA 0.997
Legal score-first AdamW test-time training with last 2 blocks unfrozen
Use of XSA across all layers and BigramHash 8192 architecture