PR #1405

open

Record: Scylla + GPTQ + BH3072 — val_bpb 1.0856 (3-seed mean)

by anthony-maioView on GitHub

val_bpb

1.0856

Architecture

Transformer

Optimizer

—

Artifact Size

15.3-15.8 MB

Training Techniques

Architecture

BigramHash

Bigram hash embedding with 3072 vocabulary and 112-dimensional representation.

parameters: {"vocab_size":3072,"dimensions":112}

XSA

Applied XSA across all layers.

parameters: {"layers":11}

VE128

Uses VE128 architectural component.

parameters: null

GQA

Grouped query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Partial RoPE

Partial rotary positional embeddings applied to a subset of dimensions.

parameters: {"numerator":16,"denominator":64}

LeakyReLU

LeakyReLU squared MLP activation.

parameters: {"slope":0.5}

SmearGate

SmearGate architectural component.

parameters: null

U-Net skip connections

U-Net style skip connections in the architecture.

parameters: null

Quantization

GPTQ

bits: 6

scope: all

late QAT

bits: null

scope: all

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997}

Compression

lzma

level: 9

Regularization

LN scale

parameters: null

Other

other

Self-generated calibration data used for full-Hessian GPTQ with Cholesky error compensation.

parameters: {"self_gen_seqs":64}

LR Schedule

warmdown

parameters: {"warmdown_steps":4000}

Novel Contributions

Scylla tokenizer with 998-vocab TokenMonster, reducing tokens per byte
AR self-generated full-Hessian GPTQ with Cholesky error compensation
BigramHash 3072x112 combined with VRL and XSA across all 11 layers
EMA + SWA, late QAT, and LZMA-9 compression to fit under 16MB
No SLOT and no TTT while achieving 1.0856 val_bpb mean over 3 seeds