PR #1857

closed

Record: PR #1787 base + PPM-D OMP byte mixture (val_bpb 1.0322 3-seed mean)

by dexhunterView on GitHub

val_bpb

1.0322

Architecture

Transformer

Optimizer

—

Artifact Size

15,998,552 bytes

Training Techniques

Architecture

SmearGate

BOS-masked content-conditioned 1-token causal lookback on first 12 residual dimensions, reset at document boundaries.

parameters: {"window":12,"bos_masked":true}

depth recurrence

Base stack includes recurrent depth/loop structure from prior PR lineage.

parameters: null

Gated Attention

Attention gating used in the base stack lineage.

parameters: null

Quantization

GPTQ

bits: 6

scope: top-3 weight tensors / MLP-related

Regularization

logit softcap

parameters: {"value":30}

Weight Averaging

EMA

parameters: null

Test-Time Training

score-first TTT

parameters: {"phases":3,"prefix_docs":2000}

Evaluation

stride-based eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Other

other

PPM-D order-4 byte-level mixture with adaptive mixing between neural logits and byte-level Markov probabilities; score-before-update and OpenMP-parallelized native C implementation.

parameters: {"order":4,"lambda_hi":0.9,"lambda_lo":0.05,"confidence_threshold":0.9,"openmp":true}

Novel Contributions

PPM-D order-4 byte-level mixture integrated with the neural model
Strict score-before-update byte scoring for PPM tables
OpenMP-parallelized native C PPM implementation embedded in train_gpt.py
SmearGate BOS-masked causal lookback mechanism
LQER asymmetric rank-4 residual correction
Adaptive mixing between NN and PPM byte distributions