PR #1991

open

Record: SP8192 + Byte-PPM Mixer with Tuned Order/Gate (O=5) — val_bpb 0.94290 (3-seed mean)

by joshuaswansonView on GitHub

val_bpb

0.9429

Architecture

Transformer

Optimizer

—

Artifact Size

15.97 MB

Training Techniques

Architecture

weight tying

Tied token embeddings in the SP8192 model.

parameters: null

Partial RoPE

Uses partial rotary positional embeddings.

parameters: {"train_length":null,"eval_length":null}

depth recurrence

Layer recurrence/looping across encoder and decoder layer sequences.

parameters: {"encoder":[0,1,2,3,4,5,3,4],"decoder":[5,3,4,5,6,7,8,9,10]}

LeakyReLU

Uses LeakyReLU activation in the MLP.

parameters: {"slope":0.5}

Regularization

layerwise LN scale

parameters: null

Quantization

GPTQ

bits: 6

scope: attention/MLP

int8

bits: 8

scope: token embeddings

Compression

lzma

level: null

Brotli

level: 11

Evaluation

sliding window eval

parameters: null

Other

other

Causal byte-PPM mixer used at evaluation time with tuned order and gate hyperparameters.

parameters: {"PPM_ORDER":5,"PPM_T":0.8,"PPM_H":0.99,"PPM_L":0.2}

Novel Contributions

Systematic offline sweep of byte-PPM mixer hyperparameters on the SP8192 distribution
Improved PPM order from 4 to 5
Tuned gate threshold and lambda parameters for the causal PPM mixer
Achieved 0.94290 val_bpb 3-seed mean on full FineWeb validation
Kept training pipeline and neural network byte-identical to PR #1959