PR #1991

open

Record: SP8192 + Byte-PPM Mixer with Tuned Order/Gate (O=5) — val_bpb 0.94290 (3-seed mean)

by joshuaswansonView on GitHub
val_bpb
0.9429
Architecture
Transformer
Optimizer
Artifact Size
15.97 MB

Training Techniques

Architecture
weight tying
Tied token embeddings in the SP8192 model.
parameters: null
Partial RoPE
Uses partial rotary positional embeddings.
parameters: {"train_length":null,"eval_length":null}
depth recurrence
Layer recurrence/looping across encoder and decoder layer sequences.
parameters: {"encoder":[0,1,2,3,4,5,3,4],"decoder":[5,3,4,5,6,7,8,9,10]}
LeakyReLU
Uses LeakyReLU activation in the MLP.
parameters: {"slope":0.5}
Regularization
layerwise LN scale
parameters: null
Quantization
GPTQ
bits: 6
scope: attention/MLP
int8
bits: 8
scope: token embeddings
Compression
lzma
level: null
Brotli
level: 11
Evaluation
sliding window eval
parameters: null
Other
other
Causal byte-PPM mixer used at evaluation time with tuned order and gate hyperparameters.
parameters: {"PPM_ORDER":5,"PPM_T":0.8,"PPM_H":0.99,"PPM_L":0.2}

Novel Contributions

  • Systematic offline sweep of byte-PPM mixer hyperparameters on the SP8192 distribution
  • Improved PPM order from 4 to 5
  • Tuned gate threshold and lambda parameters for the causal PPM mixer
  • Achieved 0.94290 val_bpb 3-seed mean on full FineWeb validation
  • Kept training pipeline and neural network byte-identical to PR #1959