PR #728

open

Record: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112

by abaybektursunView on GitHub

val_bpb

1.1142

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~15.86 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

Architecture

XSA

XSA attention applied on all layers

parameters: {"layers":11}

BigramHash

Wider bigram hash embedding/table used to improve quality while staying under artifact budget

parameters: {"vocab_size":3072,"dimension":112}

MLP3x

Three-times widened MLP with LeakyReLU squared

parameters: {"layers":11}

Partial RoPE

Rotary positional embeddings applied to a subset of dimensions

parameters: {"dimensions":16,"base_dimensions":64}

SmearGate

Position-mixing gate

parameters: null

U-Net skips

Encoder-decoder skip connections

parameters: null

KV head count

Attention uses 8 GQA heads and 4 KV heads

parameters: {"gqa_heads":8,"kv_heads":4}

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_every":50}

Compression

lzma

level: 9

LR Schedule

warmdown

parameters: {"warmdown_iters":4000}

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Evaluation

sliding window eval

parameters: {"stride":64}

Other

other

Validation-data GPTQ calibration using forward-only Hessian collection on validation tokens instead of training tokens

parameters: {"calib_batches":64}

other

Selective ±1 pruning by reconstruction error

parameters: null

other

Parallel Muon optimizer with parameter banking and overlapped communication

parameters: {"parameter_banks":4}

Novel Contributions

Validation-data GPTQ calibration to avoid eval-time training-data access
BigramHash widened to 3072 × 112
Full Hessian GPTQ int6 quantization with val calibration
XSA-all stack combined with selective pruning and artifact-budget tuning
Parallel Muon optimizer context enabling ~6.95k steps in 600s