PR #1854
openRecord: PR #1797 base + PPM-D byte mixture — val_bpb 0.90236 (3-seed mean)
by ndokutovichView on GitHub
val_bpb
0.9024
Architecture
Transformer
Optimizer
—
Artifact Size
15.95 MB
Training Techniques
Architecture
weight tying
Tied input and output embeddings.
parameters: null
SmearGate
Smear gate used in the PR #1797 base stack.
parameters: {"gate_window":12}
LeakyReLU
LeakyReLU(0.5)^2 activation.
parameters: null
depth recurrence
Looped encoder/decoder depth recurrence with parallel residual start.
parameters: {"layers":11,"parallel_residual_start":8}
Gated Attention
SparseAttnGate / PolarNS attention gating in the base stack.
parameters: null
Quantization
GPTQ
bits: 6
scope: matrix weights
GPTQ
bits: 7
scope: embeddings
Compression
brotli + lzma
level: null
Test-Time Training
score-first TTT
parameters: {"phases":3,"prefix_docs":2000}
Evaluation
single left-to-right pass
parameters: null
Other
other
PPM-D byte-level mixture applied at evaluation time, combining neural and byte-context probabilities with a binary lambda gate.
parameters: {"order":5,"subset_tokens":8000000,"lambda_hi":0.9,"lambda_lo":0.05,"confidence_threshold":0.9}
Regularization
logit softcap
parameters: {"value":30}
Novel Contributions
- Ports the PR #1835 PPM-D byte-level mixture onto the PR #1797 neural base stack.
- Uses a score-first, causal byte-level mixture that updates PPM counts only after scoring each byte.
- Achieves a 3-seed mean val_bpb of 0.90236 with low variance.
- Includes parallel CaseOps re-tokenization for faster data preparation.
- Combines PR #1797 neural baseline with eval-time PPM-D augmentation under Issue #1017 compliance.