PR #1578

open

Record: Custom Casefold Tokenizer — 1.0668 BPB

by mikeapediaView on GitHub
val_bpb
1.0668
Architecture
Transformer
Optimizer
Muon
Artifact Size
~16.00 MB

Training Techniques

Architecture
parallel residuals
Dual-lane parallel residuals architecture inherited from PR #1529.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: 0.97
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Test-Time Training
TTT
parameters: null
Other
other
Custom casefold v2 tokenizer built by retraining SentencePiece BPE on NFKC-normalized, lowercased text and refilling freed vocabulary slots with BPB-optimized subwords.
parameters: {"vocab_size":8192,"normalized_text":"NFKC + lowercased"}
other
Retokenized the FineWeb 10B dataset using casefold normalization for both train and validation data.
parameters: {"dataset":"FineWeb 10B"}

Novel Contributions

  • Custom casefold v2 tokenizer that removes case-duplicate tokens
  • Refilled 374 freed vocabulary slots with BPB-optimized subwords
  • Retokenized FineWeb 10B with NFKC normalization plus lowercasing
  • Verified byte counting correctness on the full FineWeb corpus with zero mismatches
  • Reported a new record validation BPB of 1.06681663