PR #1271

open

Non-record: Scylla Tokenizer Byte Accounting Audit — Sub-1.0 Was a Measurement Error

by andrewbaggio1View on GitHub
val_bpb
1.1289
Architecture
Transformer
Optimizer
Artifact Size

Training Techniques

Quantization
GPTQ
bits: null
scope: weights
Architecture
BigramHash
Uses a bigram vocabulary/embedding component in the tokenizer/model stack.
parameters: {"vocab_size":2816,"dim":112}
XSA
Uses XSA as part of the model architecture.
parameters: {"last_n":11}
TTT
Test-time training is explicitly disabled.
parameters: {"enabled":0}
Sequence Length
sequence_length
train_length: null
eval_length: null

Novel Contributions

  • Audits PR #1184's Scylla tokenizer byte accounting
  • Identifies a bug in candidate.meta.npz where 27 byte-fallback tokens use base_bytes=3 instead of 1
  • Shows the reported sub-1.0 BPB was a measurement error caused by incorrect byte denominator accounting
  • Recomputes validation with corrected meta and proper train/val split
  • Provides corrected_meta.npz and retokenize_proper.py for reproduction