PR #1796

open

Record: Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553)

by simon-marcusView on GitHub
val_bpb
1.0806
Architecture
Transformer
Optimizer
Artifact Size
15,855,763 bytes

Training Techniques

Test-Time Training
score-first TTT
parameters: null
Other
other
Custom TokenMonster-derived tokenizer ('Scylla') selected via autoresearch and proxy validation, then used to retokenize the full FineWeb bundle.
parameters: null
other
Full-data retokenized competition bundle with explicit per-token metadata for runtime byte accounting.
parameters: null
Evaluation
sliding window eval
parameters: null

Novel Contributions

  • Introduced a novel TokenMonster-derived tokenizer ('Scylla') instead of using the default sp1024 tokenization.
  • Used autoresearch and proxy validation to search tokenizer families and promote a pruned TokenMonster variant.
  • Built a full-data retokenized FineWeb bundle with metadata-driven runtime byte accounting.
  • Applied legal score-first test-time training following the accepted PR #461 framework.
  • Reported a new record validation score of 1.08056553 bpb.