PR #1796
openRecord: Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553)
by simon-marcusView on GitHub
val_bpb
1.0806
Architecture
Transformer
Optimizer
—
Artifact Size
15,855,763 bytes
Training Techniques
Test-Time Training
score-first TTT
parameters: null
Other
other
Custom TokenMonster-derived tokenizer ('Scylla') selected via autoresearch and proxy validation, then used to retokenize the full FineWeb bundle.
parameters: null
other
Full-data retokenized competition bundle with explicit per-token metadata for runtime byte accounting.
parameters: null
Evaluation
sliding window eval
parameters: null
Novel Contributions
- Introduced a novel TokenMonster-derived tokenizer ('Scylla') instead of using the default sp1024 tokenization.
- Used autoresearch and proxy validation to search tokenizer families and promote a pruned TokenMonster variant.
- Built a full-data retokenized FineWeb bundle with metadata-driven runtime byte accounting.
- Applied legal score-first test-time training following the accepted PR #461 framework.
- Reported a new record validation score of 1.08056553 bpb.