PR #1785

open

Record: SP4096 + byte-level PPM adaptive-λ mixture — val_bpb 1.01925 (3-seed)

val_bpb

1.0192

Architecture

Transformer

Optimizer

—

Artifact Size

15.96-15.98 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: model weights

Compression

lzma

level: null

Weight Averaging

EMA

parameters: null

Architecture

depth recurrence

Depth-recurrent Transformer stack inherited from the base submission.

parameters: null

LeakyReLU

Leaky ReLU activation used in the MLP.

parameters: null

Evaluation

sliding window eval

parameters: {"context_length":4096}

Test-Time Training

score-first TTT

parameters: {"method":"byte-level PPM mixture on already-scored val tokens"}

Sequence Length

sequence_length

train_length: 4096

eval_length: 4096

Other

other

Adaptive-λ byte-level PPM-D mixture with NN per-token logprobs in byte-probability space, gated by PPM confidence.

parameters: {"order":5,"lambda_high":0.9,"lambda_low":0.05,"threshold":0.9}

Adds a byte-level order-5 PPM-D predictor at evaluation time.
Mixes NN and PPM probabilities in byte-probability space.
Uses an adaptive-λ gate based on PPM confidence to route rare-repeat bytes to PPM.
Collects per-token logprobs across ranks and applies the mixture on the first 5M tokens.
Compresses the training script with an lzma+base85 exec-stub to fit under the 16 MB artifact cap.