PR #756

open

Non-record: Negative results — quantization algorithms & TTT on val-GPTQ stack

by abaybektursunView on GitHub
val_bpb
1.1142
Architecture
Optimizer
Artifact Size

Training Techniques

Quantization
GPTQ
bits: 6
scope: all
Architecture
XSA-all
Architecture modification used in the stack; all XSA components enabled.
parameters: null
BigramHash
BigramHash 3072 component used in the stack.
parameters: {"size":3072}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"stride":64}
full TTT
parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"stride":64}
MLP-down-only TTT
parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"stride":64}
MLP-all TTT
parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"stride":64}
Other
other
Qronos iterative Hessian quantization refinement with 3 iterations.
parameters: {"iterations":3}
other
CDQuant coordinate descent rounding refinement with 3 passes.
parameters: {"passes":3}

Novel Contributions

  • Benchmarked Qronos iterative Hessian refinement on the val-calibrated GPTQ int6 stack and found it worse than baseline GPTQ.
  • Benchmarked CDQuant coordinate descent rounding refinement on the same stack and found it worse than baseline GPTQ.
  • Evaluated score-first test-time training on the val-GPTQ stack with full, MLP-down-only, and MLP-all variants, finding no improvement.
  • Reported that GPTQ at int6 is near-optimal on this stack, with only a small remaining quantization gap.
  • Documented 25 total failed TTT attempts across two stacks and argued that TTT is ineffective here.