val_bpb
1.2294
Architecture
Transformer
Optimizer
—
Artifact Size
15.1 MiB
Training Techniques
Other
other
Superchunk BPE tokenization: a two-phase BPE where first phase learns merges inside chunks, second phase learns cross-chunk merges interleaved by frequency into one merge table.
Novel Contributions
- Introduction of Superchunk BPE tokenization combining phase-1 (within chunk) and phase-2 (cross-chunk) merges into a single merge table.
- Use of Rust BPE tokenizer with superchunking for vocab size 1024.
- Short training run on 8×H100 GPUs with 600s wall clock time targeting non-record 16MB track.