PR #1184
openRecord: Scylla + Full GPTQ + XSA-all + FA3 — val_bpb 0.9485 (3-seed mean)
by icryoView on GitHub
val_bpb
0.9485
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.6 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: all
Architecture
XSA
Exclusive self-attention applied to all layers
parameters: {"layers":11}
BigramHash
Bigram hash embedding component
parameters: {"vocab_size":2816,"dim":112}
SmearGate
SmearGate gating mechanism
parameters: null
Partial RoPE
Partial rotary positional embeddings
parameters: {"dimensions":16}
LeakyReLU
Leaky ReLU squared MLP activation
parameters: null
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Tight SWA
parameters: {"interval":50}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
TTT
parameters: {"enabled":false}
Regularization
LN scale
parameters: {"formula":"1/sqrt(l+1)"}
Compression
lzma
level: null
Novel Contributions
- Combines the Scylla tokenizer with the modern PR #1060 training stack
- Uses full Hessian GPTQ with Cholesky error compensation
- Applies XSA to all 11 layers
- Uses a coprime-stride multi-shard loader across 194 shards
- Uses FlashAttention 3 on Hopper GPUs
- Achieves a new record val_bpb of 0.9485 with 3-seed verification