PR #756

open

Non-record: Negative results — quantization algorithms & TTT on val-GPTQ stack

by abaybektursunView on GitHub

val_bpb

1.1142

Architecture

—

Optimizer

—

Artifact Size

—

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

Architecture

XSA-all

Architecture modification used in the stack; all XSA components enabled.

parameters: null

BigramHash

BigramHash 3072 component used in the stack.

parameters: {"size":3072}

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"stride":64}

full TTT

parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"stride":64}

MLP-down-only TTT

parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"stride":64}

MLP-all TTT

parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"stride":64}

Other

other

Qronos iterative Hessian quantization refinement with 3 iterations.

parameters: {"iterations":3}

other

CDQuant coordinate descent rounding refinement with 3 passes.

parameters: {"passes":3}

Benchmarked Qronos iterative Hessian refinement on the val-calibrated GPTQ int6 stack and found it worse than baseline GPTQ.
Benchmarked CDQuant coordinate descent rounding refinement on the same stack and found it worse than baseline GPTQ.
Evaluated score-first test-time training on the val-GPTQ stack with full, MLP-down-only, and MLP-all variants, finding no improvement.
Reported that GPTQ at int6 is near-optimal on this stack, with only a small remaining quantization gap.
Documented 25 total failed TTT attempts across two stacks and argued that TTT is ineffective here.