PR #1782

open

Non-record: NN + byte-level PPM adaptive-λ mixture demonstration

val_bpb
1.4131
Architecture
Transformer
Optimizer
Artifact Size
15.87 MB

Training Techniques

Quantization
int8
bits: 8
scope: artifact
Architecture
weight tying
Tied embeddings in the baseline SP1024 Transformer.
parameters: null
GQA
Grouped query attention in the baseline model.
parameters: {"kv_heads":"8/4"}
MLP2x
Two-times MLP width in the baseline model.
parameters: {"multiplier":2}
Evaluation
subsampled validation eval
parameters: {"tokens":5000000}
Test-Time Training
full TTT
parameters: {"online_on_validation_tokens":true}
Other
other
Adaptive-λ byte-level PPM-D order-5 mixture with the neural network in probability space during evaluation.
parameters: {"order":5,"adaptive_lambda":true,"byte_level":true}
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null

Novel Contributions

  • Adaptive-λ byte-level PPM-D order-5 mixture with the neural network
  • Demonstration that byte-level mixture gains persist across multiple NN quality tiers
  • Non-record empirical evidence that no-ngram-cache leaderboard submissions leave room for byte-level statistical predictors
  • Validation of a composable eval-time mixture that can be added with minimal changes to eval_val