PR #1785

open

Record: SP4096 + byte-level PPM adaptive-λ mixture — val_bpb 1.01925 (3-seed)

val_bpb
1.0192
Architecture
Transformer
Optimizer
Artifact Size
15.96-15.98 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: model weights
Compression
lzma
level: null
Weight Averaging
EMA
parameters: null
Architecture
depth recurrence
Depth-recurrent Transformer stack inherited from the base submission.
parameters: null
LeakyReLU
Leaky ReLU activation used in the MLP.
parameters: null
Evaluation
sliding window eval
parameters: {"context_length":4096}
Test-Time Training
score-first TTT
parameters: {"method":"byte-level PPM mixture on already-scored val tokens"}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
Other
other
Adaptive-λ byte-level PPM-D mixture with NN per-token logprobs in byte-probability space, gated by PPM confidence.
parameters: {"order":5,"lambda_high":0.9,"lambda_low":0.05,"threshold":0.9}

Novel Contributions

  • Adds a byte-level order-5 PPM-D predictor at evaluation time.
  • Mixes NN and PPM probabilities in byte-probability space.
  • Uses an adaptive-λ gate based on PPM confidence to route rare-repeat bytes to PPM.
  • Collects per-token logprobs across ranks and applies the mixture on the first 5M tokens.
  • Compresses the training script with an lzma+base85 exec-stub to fit under the 16 MB artifact cap.