PR #1341

open

Non-Record: TTT and GPTQ Are Fundamentally Incompatible — Quantized Weight Structure Defeats Test-Time Adaptation

by himanshudongreView on GitHub
val_bpb
1.1000
Architecture
Transformer
Optimizer
SGD
Artifact Size

Training Techniques

Quantization
GPTQ
bits: 6
scope: all
Test-Time Training
LoRA TTT
parameters: {"rank":8}
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"epochs":3,"freeze_blocks":2,"grad_clip":1}
LR Schedule
cosine decay
parameters: {"chunk_size":32768}
Architecture
LoRA
Rank-8 adapters applied to Q and V projections during test-time training.
parameters: {"rank":8,"targets":["Q","V"]}
Sequence Length
sequence_length
train_length: 32768
eval_length: null

Novel Contributions

  • Argues that GPTQ and test-time training are fundamentally incompatible.
  • Aggregates evidence across multiple PRs showing TTT helps on simple int6 quantization but not on GPTQ-quantized models.
  • Provides a root-cause explanation based on GPTQ's compensatory weight structure being disrupted by SGD updates.
  • Reports a rank-8 LoRA TTT experiment on GPTQ weights with negligible BPB improvement.
  • Proposes possible fixes such as quantization-aware TTT, structured TTT, and higher-rank LoRA.