PR #508

closed

GPTQ + Early QAT + Legal TTT — 3-seed mean val_bpb 1.1215

by newjordanView on GitHub

val_bpb

1.1215

Architecture

Transformer

Optimizer

SGD

Artifact Size

15.56 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: weights

QAT

bits: 6

scope: weights

Architecture

Partial RoPE

Uses rotary positional embeddings on only part of the dimensions.

parameters: {"dimensions":16,"base_dimensions":64}

XSA

Uses XSA in the last 4 layers.

parameters: {"layers":4}

SmearGate

Adds SmearGate to the MLP/activation path.

parameters: null

BigramHash

Adds a bigram hashing component with 2048 buckets.

parameters: {"buckets":2048}

MLP3x

Uses 3x MLP expansion with relu².

parameters: {"expansion":3}

tied embeddings

Input and output embeddings are tied.

parameters: null

KV head count

Uses grouped-query attention with 4 KV heads.

parameters: {"kv_heads":4,"heads":8}

Weight Averaging

EMA

parameters: {"decay":0.995}

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"epochs_per_chunk":3,"grad_clip":1}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":32}

Test-Time Training

score-first TTT

parameters: {"epochs":8,"learning_rate":0.002,"momentum":0.9}

LR Schedule

cosine decay

parameters: {"over_actual_training_window":true,"chunks":200}

Regularization

embedding freeze

parameters: {"frozen_components":["tok_emb","bigram","ve_shared"]}

Initialization

OrthoInit

Orthogonal initialization.

Novel Contributions

GPTQ quantization with Hessian-aware error compensation for int6 per-row quantization
Early QAT with matched clipping to the GPTQ export quantizer
Legal score-first TTT with EMA scoring and cosine LR fix
Embedding freezing during TTT
Improved quantization tax from 0.0082 to 0.0058 BPB