PR #1019

RECORDopen

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean)

by abaybektursunView on GitHub

val_bpb

1.1147

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~15.91 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

late QAT

bits: null

scope: all

Architecture

BigramHash

BigramHash embedding with wider vocabulary/dimension setting

parameters: {"vocab_size":3072,"dim":112}

XSA

Cross-position attention applied to all layers

parameters: {"layers":11}

RoPE

Partial rotary position embeddings

parameters: {"dimensions":16,"total_dimensions":64}

VE128

VE128 applied to later layers

parameters: {"layers":[9,10]}

SmearGate

Position-mixing gate

parameters: null

U-Net skip connections

Encoder-decoder skip connections

parameters: null

LeakyReLU

LeakyReLU squared MLP activation

parameters: {"squared":true}

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: null

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_every":50}

Compression

lzma

level: 9

LR Schedule

warmdown

parameters: {"warmdown_steps":4000}

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

structured pruning

parameters: {"type":"±1 by reconstruction error"}

Novel Contributions

AR self-generated calibration data for GPTQ with no val or train data access during quantization
Full Hessian GPTQ with Cholesky error compensation and column reordering
BigramHash widened to 3072 × 112
XSA applied to all 11 layers
Removal of TTT while still improving over prior SOTA