val_bpb
1.8111
Architecture
Transformer
Optimizer
—
Artifact Size
15,705,009 bytes
Training Techniques
Architecture
doc_copy_ctx2
Document-local copy expert over a discounted hashed 4-gram backoff chain, with active scoring path effectively using doc_copy_ctx2 only.
parameters: {"doc_copy_contexts":2,"ngram_contexts":3}
Compression
lzma
level: null
Sequence Length
sequence_length
train_length: 16300000
eval_length: null
Other
other
Packed 10-bit follower token storage to reduce artifact size.
parameters: {"bits":10}
Novel Contributions
- Document-local copy expert over a discounted hashed 4-gram backoff chain
- Packed 10-bit follower token storage
- lzma state compression to fit under the 16MB cap
- Artifact-only evaluation on the official fineweb_val_* split
- No training-shard access during final evaluation