PR #1857

closed

Record: PR #1787 base + PPM-D OMP byte mixture (val_bpb 1.0322 3-seed mean)

by dexhunterView on GitHub
val_bpb
1.0322
Architecture
Transformer
Optimizer
Artifact Size
15,998,552 bytes

Training Techniques

Architecture
SmearGate
BOS-masked content-conditioned 1-token causal lookback on first 12 residual dimensions, reset at document boundaries.
parameters: {"window":12,"bos_masked":true}
depth recurrence
Base stack includes recurrent depth/loop structure from prior PR lineage.
parameters: null
Gated Attention
Attention gating used in the base stack lineage.
parameters: null
Quantization
GPTQ
bits: 6
scope: top-3 weight tensors / MLP-related
Regularization
logit softcap
parameters: {"value":30}
Weight Averaging
EMA
parameters: null
Test-Time Training
score-first TTT
parameters: {"phases":3,"prefix_docs":2000}
Evaluation
stride-based eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Other
other
PPM-D order-4 byte-level mixture with adaptive mixing between neural logits and byte-level Markov probabilities; score-before-update and OpenMP-parallelized native C implementation.
parameters: {"order":4,"lambda_hi":0.9,"lambda_lo":0.05,"confidence_threshold":0.9,"openmp":true}

Novel Contributions

  • PPM-D order-4 byte-level mixture integrated with the neural model
  • Strict score-before-update byte scoring for PPM tables
  • OpenMP-parallelized native C PPM implementation embedded in train_gpt.py
  • SmearGate BOS-masked causal lookback mechanism
  • LQER asymmetric rank-4 residual correction
  • Adaptive mixing between NN and PPM byte distributions