PR #1504

open

Non-record: DP tokenizer beats naive baseline but fails to close gap to SP — 47-run controlled study (1.2206 @ 4096)

by Stuckertks09View on GitHub

val_bpb

1.2206

Architecture

Transformer

Optimizer

—

Artifact Size

—

Training Techniques

Other

other

Custom boundary-aware dynamic programming tokenizer with a frozen 1024-token vocabulary overlay

parameters: {"vocab_size":1024}

other

Controlled A/B experiments isolating tokenizer as the only variable across repeated runs

parameters: {"runs":47}

Sequence Length

sequence_length

train_length: null

eval_length: 4096

Evaluation

full validation comparison

parameters: {"num_val_docs":50000}

Custom DP tokenizer with boundary-aware segmentation
Frozen 1024-token vocabulary overlay
Full-validation compression comparison against SentencePiece on 50k documents
47-run controlled A/B study isolating tokenizer effects
Demonstration that improved compression does not translate to better training performance under fixed compute