PR #1341

open

Non-Record: TTT and GPTQ Are Fundamentally Incompatible — Quantized Weight Structure Defeats Test-Time Adaptation

by himanshudongreView on GitHub

val_bpb

1.1000

Architecture

Transformer

Optimizer

SGD

Artifact Size

—

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

Test-Time Training

LoRA TTT

parameters: {"rank":8}

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"epochs":3,"freeze_blocks":2,"grad_clip":1}

LR Schedule

cosine decay

parameters: {"chunk_size":32768}

Architecture

LoRA

Rank-8 adapters applied to Q and V projections during test-time training.

parameters: {"rank":8,"targets":["Q","V"]}

Sequence Length

sequence_length

train_length: 32768

eval_length: null

Argues that GPTQ and test-time training are fundamentally incompatible.
Aggregates evidence across multiple PRs showing TTT helps on simple int6 quantization but not on GPTQ-quantized models.
Provides a root-cause explanation based on GPTQ's compensatory weight structure being disrupted by SGD updates.
Reports a rank-8 LoRA TTT experiment on GPTQ weights with negligible BPB improvement.
Proposes possible fixes such as quantization-aware TTT, structured TTT, and higher-rank LoRA.