PR #1504

open

Non-record: DP tokenizer beats naive baseline but fails to close gap to SP — 47-run controlled study (1.2206 @ 4096)

by Stuckertks09View on GitHub
val_bpb
1.2206
Architecture
Transformer
Optimizer
Artifact Size

Training Techniques

Other
other
Custom boundary-aware dynamic programming tokenizer with a frozen 1024-token vocabulary overlay
parameters: {"vocab_size":1024}
other
Controlled A/B experiments isolating tokenizer as the only variable across repeated runs
parameters: {"runs":47}
Sequence Length
sequence_length
train_length: null
eval_length: 4096
Evaluation
full validation comparison
parameters: {"num_val_docs":50000}

Novel Contributions

  • Custom DP tokenizer with boundary-aware segmentation
  • Frozen 1024-token vocabulary overlay
  • Full-validation compression comparison against SentencePiece on 50k documents
  • 47-run controlled A/B study isolating tokenizer effects
  • Demonstration that improved compression does not translate to better training performance under fixed compute