PR #1405

open

Record: Scylla + GPTQ + BH3072 — val_bpb 1.0856 (3-seed mean)

by anthony-maioView on GitHub
val_bpb
1.0856
Architecture
Transformer
Optimizer
Artifact Size
15.3-15.8 MB

Training Techniques

Architecture
BigramHash
Bigram hash embedding with 3072 vocabulary and 112-dimensional representation.
parameters: {"vocab_size":3072,"dimensions":112}
XSA
Applied XSA across all layers.
parameters: {"layers":11}
VE128
Uses VE128 architectural component.
parameters: null
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Partial RoPE
Partial rotary positional embeddings applied to a subset of dimensions.
parameters: {"numerator":16,"denominator":64}
LeakyReLU
LeakyReLU squared MLP activation.
parameters: {"slope":0.5}
SmearGate
SmearGate architectural component.
parameters: null
U-Net skip connections
U-Net style skip connections in the architecture.
parameters: null
Quantization
GPTQ
bits: 6
scope: all
late QAT
bits: null
scope: all
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997}
Compression
lzma
level: 9
Regularization
LN scale
parameters: null
Other
other
Self-generated calibration data used for full-Hessian GPTQ with Cholesky error compensation.
parameters: {"self_gen_seqs":64}
LR Schedule
warmdown
parameters: {"warmdown_steps":4000}

Novel Contributions

  • Scylla tokenizer with 998-vocab TokenMonster, reducing tokens per byte
  • AR self-generated full-Hessian GPTQ with Cholesky error compensation
  • BigramHash 3072x112 combined with VRL and XSA across all 11 layers
  • EMA + SWA, late QAT, and LZMA-9 compression to fit under 16MB
  • No SLOT and no TTT while achieving 1.0856 val_bpb mean over 3 seeds