PR #692
closedRecord: CROWN-Q + Full GPTQ + SWA/EMA Blend — val_bpb 1.1186 (3-seed mean)
by EthanYangTWView on GitHub
val_bpb
1.1186
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,945,134 bytes
Training Techniques
Quantization
GPTQ
bits: 6
scope: all weights
QAT
bits: 6
scope: all weights
Architecture
BigramHash
BigramHash feature with size 3072
parameters: {"dimensions":3072}
Partial RoPE
Partial rotary positional embeddings
parameters: {"train_length":16,"eval_length":64}
XSA
XSA applied to all 11 layers
parameters: {"layers":11}
GQA
Grouped-query attention with 8 attention heads and 4 KV heads
parameters: {"heads":8,"kv_heads":4}
MLP3x
3x MLP with LeakyReLU(0.5)^2
parameters: {"layers":3}
Weight Averaging
SWA
parameters: {"every_steps":50,"blend":0.5}
EMA
parameters: {"decay":0.997,"blend":0.5}
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_iters":4000,"late_qat_threshold":0.15}
Regularization
CROWN-Q penalty
parameters: {"lambda":0.01,"warmdown_only":true}
Other
other
Full Cholesky GPTQ calibration with act-order and 256-sample calibration from training data within training budget
parameters: {"block_size":128,"calibration_samples":256}
Novel Contributions
- CROWN-Q curvature-weighted quantization variance penalty during warmdown
- Full Cholesky GPTQ with act-order within the training budget
- SWA/EMA 50/50 blend for final weights
- Pure inference sliding-window evaluation with no test-time training
- Architecture using XSA, BigramHash, GQA, and partial RoPE