PR #1184

open

Record: Scylla + Full GPTQ + XSA-all + FA3 — val_bpb 0.9485 (3-seed mean)

val_bpb
0.9485
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.6 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: all
Architecture
XSA
Exclusive self-attention applied to all layers
parameters: {"layers":11}
BigramHash
Bigram hash embedding component
parameters: {"vocab_size":2816,"dim":112}
SmearGate
SmearGate gating mechanism
parameters: null
Partial RoPE
Partial rotary positional embeddings
parameters: {"dimensions":16}
LeakyReLU
Leaky ReLU squared MLP activation
parameters: null
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Tight SWA
parameters: {"interval":50}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
TTT
parameters: {"enabled":false}
Regularization
LN scale
parameters: {"formula":"1/sqrt(l+1)"}
Compression
lzma
level: null

Novel Contributions

  • Combines the Scylla tokenizer with the modern PR #1060 training stack
  • Uses full Hessian GPTQ with Cholesky error compensation
  • Applies XSA to all 11 layers
  • Uses a coprime-stride multi-shard loader across 194 shards
  • Uses FlashAttention 3 on Hopper GPUs
  • Achieves a new record val_bpb of 0.9485 with 3-seed verification