PR #293

open

Non-record: Custom sp4096 BPE Tokenizer (1.2827 BPB on 1×H100)

by Nishu2000-hubView on GitHub
val_bpb
1.2827
Architecture
Transformer
Optimizer
Artifact Size
14.78 MB

Training Techniques

Architecture
tokenizer/vocabulary size
Replaced the provided 1024-vocab tokenizer with a custom 4096-vocab BPE SentencePiece tokenizer trained on FineWeb documents, reducing tokens per byte and improving BPB.
parameters: {"vocab_size":4096}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Other
other
Trained a custom BPE SentencePiece tokenizer on 2 million FineWeb documents using the same normalization and byte fallback settings as the baseline.
parameters: {"training_docs":2000000}
other
Preprocessed FineWeb training shards with the custom tokenizer and produced binary shards in the same format as the official pipeline.
parameters: {"num_shards":10}
other
Reduced model depth from 9 layers to 8 layers to stay under the 16MB artifact limit after increasing embedding table size.
parameters: {"layers":8}
Compression
zlib
level: null

Novel Contributions

  • Custom 4096-vocabulary BPE tokenizer trained on FineWeb documents
  • Improved bytes per token from 2.00 to 2.75 compared with the baseline tokenizer
  • Tokenizer-only approach that is orthogonal to model-side techniques
  • Custom preprocessing pipeline for FineWeb using the new tokenizer
  • Reduced model depth to fit within the 16MB artifact budget