PR #639
openFull GPTQ + XSA-all + SWA/EMA (val_bpb=1.1158, 3-seed mean=1.1163)
by Robby955View on GitHub
val_bpb
1.1158
Architecture
11L GEPA Transformer
Optimizer
—
Artifact Size
15.92 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: all 11 layers
Architecture
XSA
Cross-layer self-attention applied on all 11 layers
parameters: {"layers":11}
Weight Averaging
SWA+EMA blend
parameters: {"blend_ratio":"50/50","snapshots":16}
Compression
LZMA
level: 9
Evaluation
sliding window eval
parameters: null
Test-Time Training
full TTT
parameters: {"optimizers_tested":["AdamW","SGD"],"learning_rates":[0.0005,0.002,0.001],"epochs":[3,5,10],"effect":"neutral-to-harmful on GPTQ weights"}
LR Schedule
warmdown
parameters: {"warmdown_steps":4000}
Novel Contributions
- Full GPTQ quantization halves the quantization gap from 0.008 to 0.004 BPB using Cholesky-based GPTQ with act-order column permutation and block-wise error compensation.
- AdamW optimizer catastrophically harms GPTQ-calibrated weights during test-time training; TTT is neutral-to-harmful regardless of optimizer or learning rate.
- GPTQ damping factor has negligible impact on performance, showing robustness of Cholesky solve.
- Applying XSA on all 11 layers (instead of last 4) improves training quality and sliding window BPB by 0.0013.
- EB-TTT with Born-rule scaling (SNR²) is a novel per-layer TTT gradient scaling inspired by quantum probability amplitudes but provides no measurable BPB improvement on GPTQ-quantized models.