PR #528

closed

Record: GPTQ + Legal TTT (3-seed mean val_bpb=1.1195)

by EthanYangTWView on GitHub

val_bpb

1.1195

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15.96 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

QAT

bits: 6

scope: all

Architecture

XSA

XSA applied to all layers in the model

parameters: {"layers":11}

Partial RoPE

Partial rotary positional embeddings

parameters: {"dimensions":"16/64"}

SmearGate

SmearGate with OrthoInit

parameters: null

BigramHash

BigramHash feature with shared VE128 in later layers

parameters: {"dimensions":2048}

KV head count

Grouped-query attention with 8 attention heads and 4 KV heads

parameters: {"heads":8,"kv_heads":4}

MLP3x

MLP with 3x relu²

parameters: null

Optimizer

AdamW

weight_decay: 0

momentum: null

other_params: {"learning_rate":0.0001}

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":32}

Test-Time Training

score-first TTT

parameters: {"epochs_per_chunk":3,"learning_rate":0.0001,"weight_decay":0}

Initialization

OrthoInit

Orthogonal initialization used with SmearGate

Sequence Length

sequence_length

train_length: 131072

eval_length: null

LR Schedule

cosine decay

parameters: null

Regularization

layerwise LN scale

parameters: null

Other

other

Early QAT with threshold 0.5 and 0.9995 percentile clipping before GPTQ

parameters: {"threshold":0.5,"clipping_percentile":0.9995}

other

2% magnitude pruning

parameters: {"sparsity":0.02}

Novel Contributions

GPTQ quantization with Hessian-aware error compensation and column reordering
Early QAT with threshold 0.5 and longer adaptation to quantization noise
EMA decay tuned to 0.997
Legal score-first TTT where each token is scored before any gradient update using it
Sliding-window evaluation with stride 32
2% magnitude pruning and zstd-22 compression