PR #293
openNon-record: Custom sp4096 BPE Tokenizer (1.2827 BPB on 1×H100)
by Nishu2000-hubView on GitHub
val_bpb
1.2827
Architecture
Transformer
Optimizer
—
Artifact Size
14.78 MB
Training Techniques
Architecture
tokenizer/vocabulary size
Replaced the provided 1024-vocab tokenizer with a custom 4096-vocab BPE SentencePiece tokenizer trained on FineWeb documents, reducing tokens per byte and improving BPB.
parameters: {"vocab_size":4096}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Other
other
Trained a custom BPE SentencePiece tokenizer on 2 million FineWeb documents using the same normalization and byte fallback settings as the baseline.
parameters: {"training_docs":2000000}
other
Preprocessed FineWeb training shards with the custom tokenizer and produced binary shards in the same format as the official pipeline.
parameters: {"num_shards":10}
other
Reduced model depth from 9 layers to 8 layers to stay under the 16MB artifact limit after increasing embedding table size.
parameters: {"layers":8}
Compression
zlib
level: null
Novel Contributions
- Custom 4096-vocabulary BPE tokenizer trained on FineWeb documents
- Improved bytes per token from 2.00 to 2.75 compared with the baseline tokenizer
- Tokenizer-only approach that is orthogonal to model-side techniques
- Custom preprocessing pipeline for FineWeb using the new tokenizer
- Reduced model depth to fit within the 16MB artifact budget