PR #693

open

Record: CROWN-Q + Full GPTQ + SWA/EMA Blend — val_bpb 1.1186 (3-seed mean)

by EthanYangTWView on GitHub
val_bpb
1.1186
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,947,742 bytes

Training Techniques

Quantization
QAT + GPTQ
bits: 6
scope: all
Architecture
GQA
Grouped query attention with 8 attention heads and 4 KV heads
parameters: {"heads":8,"kv_heads":4}
MLP3x
Three-layer MLP using LeakyReLU activations
parameters: {"layers":3,"activation":"LeakyReLU(0.5)^2"}
XSA
XSA applied to the last 4 layers
parameters: {"layers":[7,8,9,10]}
BigramHash
BigramHash feature with size 3072
parameters: {"dimensions":3072}
Partial RoPE
Partial rotary positional embedding
parameters: {"dimensions":"16/64"}
Weight Averaging
SWA + EMA
parameters: {"blend":"50/50","ema_decay":0.997,"swa_interval_steps":50}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
disabled
parameters: {"TTT_ENABLED":0}
LR Schedule
warmdown
parameters: {"warmdown_iters":4000,"late_qat_threshold":0.15}
Regularization
CROWN-Q penalty
parameters: {"lambda":0.01,"warmdown_only":true}
Other
other
Full Cholesky GPTQ with act-order and block_size=128 using 256-sample calibration from training data
parameters: {"block_size":128,"calibration_samples":256,"act_order":true}

Novel Contributions

  • CROWN-Q curvature-weighted quantization variance penalty during warmdown
  • Full Cholesky GPTQ with act-order after training as part of model export
  • 50/50 blend of SWA and EMA
  • Architecture with GQA, XSA, BigramHash, and partial RoPE
  • Pure inference sliding-window evaluation with TTT disabled