PR #692

closed

Record: CROWN-Q + Full GPTQ + SWA/EMA Blend — val_bpb 1.1186 (3-seed mean)

by EthanYangTWView on GitHub

val_bpb

1.1186

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15,945,134 bytes

Training Techniques

Quantization

GPTQ

bits: 6

scope: all weights

QAT

bits: 6

scope: all weights

Architecture

BigramHash

BigramHash feature with size 3072

parameters: {"dimensions":3072}

Partial RoPE

Partial rotary positional embeddings

parameters: {"train_length":16,"eval_length":64}

XSA

XSA applied to all 11 layers

parameters: {"layers":11}

GQA

Grouped-query attention with 8 attention heads and 4 KV heads

parameters: {"heads":8,"kv_heads":4}

MLP3x

3x MLP with LeakyReLU(0.5)^2

parameters: {"layers":3}

Weight Averaging

SWA

parameters: {"every_steps":50,"blend":0.5}

EMA

parameters: {"decay":0.997,"blend":0.5}

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_iters":4000,"late_qat_threshold":0.15}

Regularization

CROWN-Q penalty

parameters: {"lambda":0.01,"warmdown_only":true}

Other

other

Full Cholesky GPTQ calibration with act-order and 256-sample calibration from training data within training budget

parameters: {"block_size":128,"calibration_samples":256}

Novel Contributions

CROWN-Q curvature-weighted quantization variance penalty during warmdown
Full Cholesky GPTQ with act-order within the training budget
SWA/EMA 50/50 blend for final weights
Pure inference sliding-window evaluation with no test-time training
Architecture using XSA, BigramHash, GQA, and partial RoPE