PR #1341
openNon-Record: TTT and GPTQ Are Fundamentally Incompatible — Quantized Weight Structure Defeats Test-Time Adaptation
by himanshudongreView on GitHub
val_bpb
1.1000
Architecture
Transformer
Optimizer
SGD
Artifact Size
—
Training Techniques
Quantization
GPTQ
bits: 6
scope: all
Test-Time Training
LoRA TTT
parameters: {"rank":8}
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"epochs":3,"freeze_blocks":2,"grad_clip":1}
LR Schedule
cosine decay
parameters: {"chunk_size":32768}
Architecture
LoRA
Rank-8 adapters applied to Q and V projections during test-time training.
parameters: {"rank":8,"targets":["Q","V"]}
Sequence Length
sequence_length
train_length: 32768
eval_length: null
Novel Contributions
- Argues that GPTQ and test-time training are fundamentally incompatible.
- Aggregates evidence across multiple PRs showing TTT helps on simple int6 quantization but not on GPTQ-quantized models.
- Provides a root-cause explanation based on GPTQ's compensatory weight structure being disrupted by SGD updates.
- Reports a rank-8 LoRA TTT experiment on GPTQ weights with negligible BPB improvement.
- Proposes possible fixes such as quantization-aware TTT, structured TTT, and higher-rank LoRA.