val_bpb
1.1276
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.3 MB
Training Techniques
Architecture
XSA
XSA applied across all 11 layers from the referenced architecture.
parameters: {"layers":11}
BigramHash
BigramHash component used in the model architecture.
parameters: {"dimensions":"3072x112"}
Quantization
GPTQ
bits: 6
scope: all
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"every_steps":50}
Compression
lzma
level: 9
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: null
eval_length: null
Novel Contributions
- Novel BESE tokenizer replacing SentencePiece-1024
- 287-token vocabulary built from a 38-token structured base alphabet plus 249 BPE merges
- Tokenizer design inspired by T9 phone keyboards, Huffman coding, Bionic Reading, and hierarchical encoding
- Custom data preparation pipeline that decodes SentencePiece shards, trains BPE, and re-encodes shards
- Tokenizer-agnostic BPB verification for custom-tokenizer evaluation