PR #1854

open

Record: PR #1797 base + PPM-D byte mixture — val_bpb 0.90236 (3-seed mean)

by ndokutovichView on GitHub

val_bpb

0.9024

Architecture

Transformer

Optimizer

—

Artifact Size

15.95 MB

Training Techniques

Architecture

weight tying

Tied input and output embeddings.

parameters: null

SmearGate

Smear gate used in the PR #1797 base stack.

parameters: {"gate_window":12}

LeakyReLU

LeakyReLU(0.5)^2 activation.

parameters: null

depth recurrence

Looped encoder/decoder depth recurrence with parallel residual start.

parameters: {"layers":11,"parallel_residual_start":8}

Gated Attention

SparseAttnGate / PolarNS attention gating in the base stack.

parameters: null

Quantization

GPTQ

bits: 6

scope: matrix weights

GPTQ

bits: 7

scope: embeddings

Compression

brotli + lzma

level: null

Test-Time Training

score-first TTT

parameters: {"phases":3,"prefix_docs":2000}

Evaluation

single left-to-right pass

parameters: null

Other

other

PPM-D byte-level mixture applied at evaluation time, combining neural and byte-context probabilities with a binary lambda gate.

parameters: {"order":5,"subset_tokens":8000000,"lambda_hi":0.9,"lambda_lo":0.05,"confidence_threshold":0.9}

Regularization

logit softcap

parameters: {"value":30}

Novel Contributions

Ports the PR #1835 PPM-D byte-level mixture onto the PR #1797 neural base stack.
Uses a score-first, causal byte-level mixture that updates PPM counts only after scoring each byte.
Achieves a 3-seed mean val_bpb of 0.90236 with low variance.
Includes parallel CaseOps re-tokenization for faster data preparation.
Combines PR #1797 neural baseline with eval-time PPM-D augmentation under Issue #1017 compliance.