PR #755
openGravity Tokenizer: 1.0321 BPB via ablation leverage vocabulary optimization
by dcrow85View on GitHub
val_bpb
1.0321
Architecture
Transformer
Optimizer
—
Artifact Size
15.6 MB
Training Techniques
Architecture
weight tying
Tied embeddings in a vanilla 12-layer transformer.
parameters: null
KV head count
Uses grouped-query attention with 6 attention heads and 2 KV heads.
parameters: {"heads":6,"kv_heads":2}
MLP3x
Transformer MLP uses 3x expansion (hidden size 1152).
parameters: {"mlp_mult":3,"hidden":1152}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: null
LR Schedule
linear warmup + warmdown
parameters: {"warmup_steps":50,"warmdown_iters":2500}
Other
other
Tokenizer optimization via ablation leverage scoring to replace BPE merge tokens with structurally important tokens.
parameters: {"beta":1,"replaced_merge_tokens":659,"total_merge_tokens":765,"vocab_size":1024}
other
Retokenization of the training corpus using a custom gravity tokenizer built from selected vocabulary.
parameters: null
Novel Contributions
- Ablation leverage-based tokenizer/vocabulary optimization instead of standard frequency-based BPE.
- Replacement of 659/765 merge tokens while keeping vocabulary size fixed at 1024.
- Demonstration that tokenizer composition alone accounts for the full BPB improvement.
- Use of a frozen GPT-2 reference model to score token structural importance across FineWeb contexts.
- Deterministic retokenization pipeline and correctness validation using competition evaluation code.