PR #545

closed

Record: int5 GPTQ + 33.6M model (3-seed mean val_bpb=1.1179)

by EthanYangTWView on GitHub

val_bpb

1.1179

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15.53 MB

Training Techniques

Quantization

GPTQ

bits: 5

scope: all weights

QAT

bits: 5

scope: all weights

Architecture

XSA

XSA applied to all layers

parameters: {"layers":11}

BigramHash

BigramHash token feature/module

parameters: {"dimensions":8192}

Partial RoPE

Partial rotary positional embeddings

parameters: {"train_length":null,"eval_length":null}

KV head count

8 attention heads / 8 KV heads

parameters: {"heads":8,"kv_heads":8}

MLP3.5x

Expanded MLP width to 3.5x hidden size

parameters: {"hidden_size":512,"mlp_size":1792}

Optimizer

AdamW

weight_decay: 0

momentum: null

other_params: {"learning_rate":0.0001}

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":32}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.0001,"chunk_tokens":131072,"freeze_blocks":2,"optimizer":"AdamW"}

Initialization

OrthoInit

Used for model initialization

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

LR Schedule

cosine decay

parameters: {"across_chunks":true}

Novel Contributions

First submission to achieve int5 quantization on a 33.6M model within the artifact size limit
GPTQ error compensation enabling near-lossless int5 quantization
Legal score-first test-time training where tokens are scored before any gradient update
33.6M parameter architecture with full attention and BigramHash under the 16MB limit