PR #528

closed

Record: GPTQ + Legal TTT (3-seed mean val_bpb=1.1195)

by EthanYangTWView on GitHub
val_bpb
1.1195
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.96 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: all
QAT
bits: 6
scope: all
Architecture
XSA
XSA applied to all layers in the model
parameters: {"layers":11}
Partial RoPE
Partial rotary positional embeddings
parameters: {"dimensions":"16/64"}
SmearGate
SmearGate with OrthoInit
parameters: null
BigramHash
BigramHash feature with shared VE128 in later layers
parameters: {"dimensions":2048}
KV head count
Grouped-query attention with 8 attention heads and 4 KV heads
parameters: {"heads":8,"kv_heads":4}
MLP3x
MLP with 3x relu²
parameters: null
Optimizer
AdamW
weight_decay: 0
momentum: null
other_params: {"learning_rate":0.0001}
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":32}
Test-Time Training
score-first TTT
parameters: {"epochs_per_chunk":3,"learning_rate":0.0001,"weight_decay":0}
Initialization
OrthoInit
Orthogonal initialization used with SmearGate
Sequence Length
sequence_length
train_length: 131072
eval_length: null
LR Schedule
cosine decay
parameters: null
Regularization
layerwise LN scale
parameters: null
Other
other
Early QAT with threshold 0.5 and 0.9995 percentile clipping before GPTQ
parameters: {"threshold":0.5,"clipping_percentile":0.9995}
other
2% magnitude pruning
parameters: {"sparsity":0.02}

Novel Contributions

  • GPTQ quantization with Hessian-aware error compensation and column reordering
  • Early QAT with threshold 0.5 and longer adaptation to quantization noise
  • EMA decay tuned to 0.997
  • Legal score-first TTT where each token is scored before any gradient update using it
  • Sliding-window evaluation with stride 32
  • 2% magnitude pruning and zstd-22 compression