PR #1850

open

Record: SP8192 + Strict Full-Val Byte PPM Mixture — 1.00495 BPB (3-seed mean)

by someone114514View on GitHub
val_bpb
1.0050
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,997,433 B

Training Techniques

Architecture
weight tying
Tied input and output embeddings in the base SP8192 stack.
parameters: null
depth recurrence
Layer recurrence over a subset of layers in the base architecture.
parameters: {"layers":[3,5]}
RoPE
Partial rotary positional embeddings used in the base stack.
parameters: null
LeakyReLU
LeakyReLU activation used in the MLP blocks.
parameters: {"slope":0.5}
MLP3x
Expanded MLP width in the base architecture.
parameters: {"multiplier":4}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Optimizer
AdamW
weight_decay: 0.095
momentum: null
other_params: {"matrix_lr":0.022}
LR Schedule
warmdown
parameters: {"warmdown":0.72}
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: null
Quantization
GPTQ
bits: 6
scope: attention/MLP
int8
bits: 8
scope: embeddings
Compression
brotli
level: 11
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
Strict full-validation byte-level PPM-D mixture applied online at evaluation time with prefix-only binary gating and score-before-update byte ordering.
parameters: {"ppm_order":4,"lambda_hi":0.9,"lambda_lo":0.05,"conf_threshold":0.9}

Novel Contributions

  • Strict full-validation byte-level PPM-D mixture over the sliding-window NN scores
  • Prefix-only binary gating between NN and PPM based on context confidence
  • Score-before-update online byte scoring to preserve causality
  • Native C runtime scorer with open-addressed context tables and cached logs
  • Removal of eval-time TTT from the packed artifact to fit under the 16 MB cap