PR #1327

open

Record: BESE Tokenizer 287 vocab — 1.1276 BPB

val_bpb
1.1276
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.3 MB

Training Techniques

Architecture
XSA
XSA applied across all 11 layers from the referenced architecture.
parameters: {"layers":11}
BigramHash
BigramHash component used in the model architecture.
parameters: {"dimensions":"3072x112"}
Quantization
GPTQ
bits: 6
scope: all
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"every_steps":50}
Compression
lzma
level: 9
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: null
eval_length: null

Novel Contributions

  • Novel BESE tokenizer replacing SentencePiece-1024
  • 287-token vocabulary built from a 38-token structured base alphabet plus 249 BPE merges
  • Tokenizer design inspired by T9 phone keyboards, Huffman coding, Bionic Reading, and hierarchical encoding
  • Custom data preparation pipeline that decodes SentencePiece shards, trains BPE, and re-encodes shards
  • Tokenizer-agnostic BPB verification for custom-tokenizer evaluation