PR #1184

open

Record: Scylla + Full GPTQ + XSA-all + FA3 — val_bpb 0.9485 (3-seed mean)

by icryoView on GitHub

val_bpb

0.9485

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~15.6 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

Architecture

XSA

Exclusive self-attention applied to all layers

parameters: {"layers":11}

BigramHash

Bigram hash embedding component

parameters: {"vocab_size":2816,"dim":112}

SmearGate

SmearGate gating mechanism

parameters: null

Partial RoPE

Partial rotary positional embeddings

parameters: {"dimensions":16}

LeakyReLU

Leaky ReLU squared MLP activation

parameters: null

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Tight SWA

parameters: {"interval":50}

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

TTT

parameters: {"enabled":false}

Regularization

LN scale

parameters: {"formula":"1/sqrt(l+1)"}

Compression

lzma

level: null

Novel Contributions

Combines the Scylla tokenizer with the modern PR #1060 training stack
Uses full Hessian GPTQ with Cholesky error compensation
Applies XSA to all 11 layers
Uses a coprime-stride multi-shard loader across 194 shards
Uses FlashAttention 3 on Hopper GPUs
Achieves a new record val_bpb of 0.9485 with 3-seed verification