PR #639

open

Full GPTQ + XSA-all + SWA/EMA (val_bpb=1.1158, 3-seed mean=1.1163)

by Robby955View on GitHub

val_bpb

1.1158

Architecture

11L GEPA Transformer

Optimizer

—

Artifact Size

15.92 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: all 11 layers

Architecture

XSA

Cross-layer self-attention applied on all 11 layers

parameters: {"layers":11}

Weight Averaging

SWA+EMA blend

parameters: {"blend_ratio":"50/50","snapshots":16}

Compression

LZMA

level: 9

Evaluation

sliding window eval

parameters: null

Test-Time Training

full TTT

parameters: {"optimizers_tested":["AdamW","SGD"],"learning_rates":[0.0005,0.002,0.001],"epochs":[3,5,10],"effect":"neutral-to-harmful on GPTQ weights"}

LR Schedule

warmdown

parameters: {"warmdown_steps":4000}

Novel Contributions

Full GPTQ quantization halves the quantization gap from 0.008 to 0.004 BPB using Cholesky-based GPTQ with act-order column permutation and block-wise error compensation.
AdamW optimizer catastrophically harms GPTQ-calibrated weights during test-time training; TTT is neutral-to-harmful regardless of optimizer or learning rate.
GPTQ damping factor has negligible impact on performance, showing robustness of Cholesky solve.
Applying XSA on all 11 layers (instead of last 4) improves training quality and sliding window BPB by 0.0013.
EB-TTT with Born-rule scaling (SNR²) is a novel per-layer TTT gradient scaling inspired by quantum probability amplitudes but provides no measurable BPB improvement on GPTQ-quantized models.