PR #293

open

Non-record: Custom sp4096 BPE Tokenizer (1.2827 BPB on 1×H100)

by Nishu2000-hubView on GitHub

val_bpb

1.2827

Architecture

Transformer

Optimizer

—

Artifact Size

14.78 MB

Training Techniques

Architecture

tokenizer/vocabulary size

Replaced the provided 1024-vocab tokenizer with a custom 4096-vocab BPE SentencePiece tokenizer trained on FineWeb documents, reducing tokens per byte and improving BPB.

parameters: {"vocab_size":4096}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Other

other

Trained a custom BPE SentencePiece tokenizer on 2 million FineWeb documents using the same normalization and byte fallback settings as the baseline.

parameters: {"training_docs":2000000}

other

Preprocessed FineWeb training shards with the custom tokenizer and produced binary shards in the same format as the official pipeline.

parameters: {"num_shards":10}

other

Reduced model depth from 9 layers to 8 layers to stay under the 16MB artifact limit after increasing embedding table size.

parameters: {"layers":8}

Compression

zlib

level: null

Novel Contributions

Custom 4096-vocabulary BPE tokenizer trained on FineWeb documents
Improved bytes per token from 2.00 to 2.75 compared with the baseline tokenizer
Tokenizer-only approach that is orthogonal to model-side techniques
Custom preprocessing pipeline for FineWeb using the new tokenizer
Reduced model depth to fit within the 16MB artifact budget