PR #1504
openNon-record: DP tokenizer beats naive baseline but fails to close gap to SP — 47-run controlled study (1.2206 @ 4096)
by Stuckertks09View on GitHub
val_bpb
1.2206
Architecture
Transformer
Optimizer
—
Artifact Size
—
Training Techniques
Other
other
Custom boundary-aware dynamic programming tokenizer with a frozen 1024-token vocabulary overlay
parameters: {"vocab_size":1024}
other
Controlled A/B experiments isolating tokenizer as the only variable across repeated runs
parameters: {"runs":47}
Sequence Length
sequence_length
train_length: null
eval_length: 4096
Evaluation
full validation comparison
parameters: {"num_val_docs":50000}
Novel Contributions
- Custom DP tokenizer with boundary-aware segmentation
- Frozen 1024-token vocabulary overlay
- Full-validation compression comparison against SentencePiece on 50k documents
- 47-run controlled A/B study isolating tokenizer effects
- Demonstration that improved compression does not translate to better training performance under fixed compute