PR #1143

open

Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553)

by simon-marcusView on GitHub
val_bpb
1.0806
Architecture
Transformer
Optimizer
Artifact Size
15,855,763 bytes

Training Techniques

Other
other
Custom TokenMonster-derived tokenizer ('Scylla') selected via autoresearch and proxy validation, then used to retokenize the full FineWeb competition bundle.
parameters: {"parent_tokenizer":"english-1024-clean-v1","tokenizer_name":"Scylla"}
other
Runtime byte accounting driven by explicit per-token metadata rather than SentencePiece runtime inspection.
parameters: null
other
Full-data retokenized FineWeb bundle with preserved shard ordering and validation ordering.
parameters: {"train_shards":79,"val_shards":1}
Test-Time Training
score-first TTT
parameters: {"legal":true,"backward_looking":true}
Evaluation
sliding window eval
parameters: null

Novel Contributions

  • Novel TokenMonster-derived tokenizer ('Scylla') discovered through iterative autoresearch and proxy validation
  • Full-data retokenization of the competition bundle using the promoted tokenizer
  • Legal score-first backward-looking TTT evaluation path
  • Explicit metadata-driven runtime byte accounting for tokenizer evaluation
  • Tokenizer search as a leaderboard-relevant optimization axis