PR #693

open

Record: CROWN-Q + Full GPTQ + SWA/EMA Blend — val_bpb 1.1186 (3-seed mean)

by EthanYangTWView on GitHub

val_bpb

1.1186

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15,947,742 bytes

Training Techniques

Quantization

QAT + GPTQ

bits: 6

scope: all

Architecture

GQA

Grouped query attention with 8 attention heads and 4 KV heads

parameters: {"heads":8,"kv_heads":4}

MLP3x

Three-layer MLP using LeakyReLU activations

parameters: {"layers":3,"activation":"LeakyReLU(0.5)^2"}

XSA

XSA applied to the last 4 layers

parameters: {"layers":[7,8,9,10]}

BigramHash

BigramHash feature with size 3072

parameters: {"dimensions":3072}

Partial RoPE

Partial rotary positional embedding

parameters: {"dimensions":"16/64"}

Weight Averaging

SWA + EMA

parameters: {"blend":"50/50","ema_decay":0.997,"swa_interval_steps":50}

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

disabled

parameters: {"TTT_ENABLED":0}

LR Schedule

warmdown

parameters: {"warmdown_iters":4000,"late_qat_threshold":0.15}

Regularization

CROWN-Q penalty

parameters: {"lambda":0.01,"warmdown_only":true}

Other

other

Full Cholesky GPTQ with act-order and block_size=128 using 256-sample calibration from training data

parameters: {"block_size":128,"calibration_samples":256,"act_order":true}

Novel Contributions

CROWN-Q curvature-weighted quantization variance penalty during warmdown
Full Cholesky GPTQ with act-order after training as part of model export
50/50 blend of SWA and EMA
Architecture with GQA, XSA, BigramHash, and partial RoPE
Pure inference sliding-window evaluation with TTT disabled