PR #690
closedRecord: CROWN-Q + Full GPTQ + SWA/EMA Blend — val_bpb 1.1186 (3-seed mean)
by EthanYangTWView on GitHub
val_bpb
1.1186
Architecture
Transformer
Optimizer
—
Artifact Size
15,947,742 bytes
Training Techniques
Quantization
QAT + GPTQ
bits: 6
scope: all weights
Architecture
XSA
XSA applied to all 11 layers
parameters: {"layers":11}
BigramHash
BigramHash feature/module with size 3072
parameters: {"dimensions":3072}
Partial RoPE
Partial rotary positional embeddings
parameters: {"train_length":16,"eval_length":64}
MLP3x
Three-layer MLP using LeakyReLU activations
parameters: {"layers":3}
Weight Averaging
SWA + EMA
parameters: {"blend_ratio":"50/50","ema_decay":0.997,"swa_interval_steps":50}
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_iters":4000,"late_qat_threshold":0.15}
Regularization
CROWN-Q penalty
parameters: {"lambda":0.01,"warmdown_only":true}
Novel Contributions
- CROWN-Q curvature-weighted quantization variance penalty during warmdown
- Full Cholesky GPTQ with act-order and calibration within training budget
- SWA/EMA 50/50 blend with EMA decay 0.997
- Pure inference sliding-window evaluation with stride 64
- 11-layer architecture with XSA, VRL, BigramHash 3072, and partial RoPE