PR #690

closed

Record: CROWN-Q + Full GPTQ + SWA/EMA Blend — val_bpb 1.1186 (3-seed mean)

val_bpb

1.1186

Architecture

Transformer

Optimizer

—

Artifact Size

15,947,742 bytes

Training Techniques

Quantization

QAT + GPTQ

bits: 6

scope: all weights

Architecture

XSA

XSA applied to all 11 layers

parameters: {"layers":11}

BigramHash

BigramHash feature/module with size 3072

parameters: {"dimensions":3072}

Partial RoPE

Partial rotary positional embeddings

parameters: {"train_length":16,"eval_length":64}

MLP3x

Three-layer MLP using LeakyReLU activations

parameters: {"layers":3}

Weight Averaging

SWA + EMA

parameters: {"blend_ratio":"50/50","ema_decay":0.997,"swa_interval_steps":50}

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_iters":4000,"late_qat_threshold":0.15}

Regularization

CROWN-Q penalty

parameters: {"lambda":0.01,"warmdown_only":true}