PR #544

closed

int5 GPTQ + 33.6M model: 1.1179 BPB (3-seed mean)

by EthanYangTWView on GitHub
val_bpb
1.1179
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.53 MB

Training Techniques

Quantization
int5 GPTQ
bits: 5
scope: per-row all weights
Early QAT
bits: null
scope: all
Architecture
XSA
XSA applied to all layers
parameters: {"layers":11}
BigramHash
BigramHash embedding/hash component
parameters: {"dimensions":8192}
MLP3.5x
Expanded MLP width to 3.5x
parameters: {"hidden_size":1792}
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"score_first":true}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"start_step":5450}
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: {"stride":32}
Test-Time Training
score-first AdamW TTT
parameters: {"chunk_tokens":131072,"epochs":3,"learning_rate":0.0001,"freeze_blocks":2,"stride":32}
Regularization
magnitude pruning
parameters: {"pct":0.02}

Novel Contributions

  • First submission to achieve int5 quantization on a 33.6M model within the artifact size limit
  • GPTQ error compensation for int5 per-row quantization
  • Early QAT with threshold 0.5 and EMA 0.997
  • Legal score-first AdamW test-time training with last 2 blocks unfrozen
  • Use of XSA across all layers and BigramHash 8192 architecture