PR #1782

open

Non-record: NN + byte-level PPM adaptive-λ mixture demonstration

val_bpb

1.4131

Architecture

Transformer

Optimizer

—

Artifact Size

15.87 MB

Training Techniques

Quantization

int8

bits: 8

scope: artifact

Architecture

weight tying

Tied embeddings in the baseline SP1024 Transformer.

parameters: null

GQA

Grouped query attention in the baseline model.

parameters: {"kv_heads":"8/4"}

MLP2x

Two-times MLP width in the baseline model.

parameters: {"multiplier":2}

Evaluation

subsampled validation eval

parameters: {"tokens":5000000}

Test-Time Training

full TTT

parameters: {"online_on_validation_tokens":true}

Other

other

Adaptive-λ byte-level PPM-D order-5 mixture with the neural network in probability space during evaluation.

parameters: {"order":5,"adaptive_lambda":true,"byte_level":true}

Compression

zlib

level: null

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Adaptive-λ byte-level PPM-D order-5 mixture with the neural network
Demonstration that byte-level mixture gains persist across multiple NN quality tiers
Non-record empirical evidence that no-ngram-cache leaderboard submissions leave room for byte-level statistical predictors
Validation of a composable eval-time mixture that can be added with minimal changes to eval_val