PR #508

closed

GPTQ + Early QAT + Legal TTT — 3-seed mean val_bpb 1.1215

by newjordanView on GitHub
val_bpb
1.1215
Architecture
Transformer
Optimizer
SGD
Artifact Size
15.56 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: weights
QAT
bits: 6
scope: weights
Architecture
Partial RoPE
Uses rotary positional embeddings on only part of the dimensions.
parameters: {"dimensions":16,"base_dimensions":64}
XSA
Uses XSA in the last 4 layers.
parameters: {"layers":4}
SmearGate
Adds SmearGate to the MLP/activation path.
parameters: null
BigramHash
Adds a bigram hashing component with 2048 buckets.
parameters: {"buckets":2048}
MLP3x
Uses 3x MLP expansion with relu².
parameters: {"expansion":3}
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Uses grouped-query attention with 4 KV heads.
parameters: {"kv_heads":4,"heads":8}
Weight Averaging
EMA
parameters: {"decay":0.995}
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"epochs_per_chunk":3,"grad_clip":1}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":32}
Test-Time Training
score-first TTT
parameters: {"epochs":8,"learning_rate":0.002,"momentum":0.9}
LR Schedule
cosine decay
parameters: {"over_actual_training_window":true,"chunks":200}
Regularization
embedding freeze
parameters: {"frozen_components":["tok_emb","bigram","ve_shared"]}
Initialization
OrthoInit
Orthogonal initialization.

Novel Contributions

  • GPTQ quantization with Hessian-aware error compensation for int6 per-row quantization
  • Early QAT with matched clipping to the GPTQ export quantizer
  • Legal score-first TTT with EMA scoring and cosine LR fix
  • Embedding freezing during TTT
  • Improved quantization tax from 0.0082 to 0.0058 BPB