PR #1857
closedRecord: PR #1787 base + PPM-D OMP byte mixture (val_bpb 1.0322 3-seed mean)
by dexhunterView on GitHub
val_bpb
1.0322
Architecture
Transformer
Optimizer
—
Artifact Size
15,998,552 bytes
Training Techniques
Architecture
SmearGate
BOS-masked content-conditioned 1-token causal lookback on first 12 residual dimensions, reset at document boundaries.
parameters: {"window":12,"bos_masked":true}
depth recurrence
Base stack includes recurrent depth/loop structure from prior PR lineage.
parameters: null
Gated Attention
Attention gating used in the base stack lineage.
parameters: null
Quantization
GPTQ
bits: 6
scope: top-3 weight tensors / MLP-related
Regularization
logit softcap
parameters: {"value":30}
Weight Averaging
EMA
parameters: null
Test-Time Training
score-first TTT
parameters: {"phases":3,"prefix_docs":2000}
Evaluation
stride-based eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Other
other
PPM-D order-4 byte-level mixture with adaptive mixing between neural logits and byte-level Markov probabilities; score-before-update and OpenMP-parallelized native C implementation.
parameters: {"order":4,"lambda_hi":0.9,"lambda_lo":0.05,"confidence_threshold":0.9,"openmp":true}
Novel Contributions
- PPM-D order-4 byte-level mixture integrated with the neural model
- Strict score-before-update byte scoring for PPM tables
- OpenMP-parallelized native C PPM implementation embedded in train_gpt.py
- SmearGate BOS-masked causal lookback mechanism
- LQER asymmetric rank-4 residual correction
- Adaptive mixing between NN and PPM byte distributions