PR #755

open

Gravity Tokenizer: 1.0321 BPB via ablation leverage vocabulary optimization

val_bpb

1.0321

Architecture

Transformer

Optimizer

—

Artifact Size

15.6 MB

Training Techniques

Architecture

weight tying

Tied embeddings in a vanilla 12-layer transformer.

parameters: null

KV head count

Uses grouped-query attention with 6 attention heads and 2 KV heads.

parameters: {"heads":6,"kv_heads":2}

MLP3x

Transformer MLP uses 3x expansion (hidden size 1152).

parameters: {"mlp_mult":3,"hidden":1152}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

Evaluation

sliding window eval

parameters: null

LR Schedule

linear warmup + warmdown

parameters: {"warmup_steps":50,"warmdown_iters":2500}

Other

other

Tokenizer optimization via ablation leverage scoring to replace BPE merge tokens with structurally important tokens.

parameters: {"beta":1,"replaced_merge_tokens":659,"total_merge_tokens":765,"vocab_size":1024}

other

Retokenization of the training corpus using a custom gravity tokenizer built from selected vocabulary.

parameters: null

Ablation leverage-based tokenizer/vocabulary optimization instead of standard frequency-based BPE.
Replacement of 659/765 merge tokens while keeping vocabulary size fixed at 1024.
Demonstration that tokenizer composition alone accounts for the full BPB improvement.
Use of a frozen GPT-2 reference model to score token structural importance across FineWeb contexts.
Deterministic retokenization pipeline and correctness validation using competition evaluation code.