PR #1271
openNon-record: Scylla Tokenizer Byte Accounting Audit — Sub-1.0 Was a Measurement Error
by andrewbaggio1View on GitHub
val_bpb
1.1289
Architecture
Transformer
Optimizer
—
Artifact Size
—
Training Techniques
Quantization
GPTQ
bits: null
scope: weights
Architecture
BigramHash
Uses a bigram vocabulary/embedding component in the tokenizer/model stack.
parameters: {"vocab_size":2816,"dim":112}
XSA
Uses XSA as part of the model architecture.
parameters: {"last_n":11}
TTT
Test-time training is explicitly disabled.
parameters: {"enabled":0}
Sequence Length
sequence_length
train_length: null
eval_length: null
Novel Contributions
- Audits PR #1184's Scylla tokenizer byte accounting
- Identifies a bug in candidate.meta.npz where 27 byte-fallback tokens use base_bytes=3 instead of 1
- Shows the reported sub-1.0 BPB was a measurement error caused by incorrect byte denominator accounting
- Recomputes validation with corrected meta and proper train/val split
- Provides corrected_meta.npz and retokenize_proper.py for reproduction