PR #1873

open

Record: SP10240 Casefold + TTT + GPTQ + PPM-D — val_bpb 0.82005771 (3-seed mean)

by schattenjuwelView on GitHub
val_bpb
0.8201
Architecture
Transformer
Optimizer
SGD
Artifact Size
~15.99 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: weights and embeddings
Test-Time Training
full TTT
parameters: {"learning_rate":0.008,"epochs":4}
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
cosine decay
parameters: null
Other
other
SP10240 casefold tokenizer with Unicode casefolding
parameters: {"vocab_size":10240}
other
Byte-level PPM-D order-5 causal mixture with confidence gating and token-level probability mixing
parameters: {"order":5,"lambda_high_confidence":0.05,"confidence_threshold":0.9}

Novel Contributions

  • Byte-level PPM-D order-5 mixture added on top of TTT + GPTQ + SP10240 casefold stack
  • Causal score-before-update PPM running on Rank 0 after distributed TTT scoring
  • Token-level probability-space mixing of neural and PPM predictions
  • Confidence-gated mixing based on PPM confidence
  • 3-seed validation with reported mean BPB