PR #1881

open

Record: PR #1797 base + PPM-D byte mixture (v2, full-val coverage) — mix_bpb 0.9019 / quantized_ttt 1.0621

by ndokutovichView on GitHub
val_bpb
0.9019
Architecture
Transformer
Optimizer
Artifact Size
15.95 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: model weights
Architecture
GQA
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
SmearGate
Causal content-conditioned attention gate with 1-token lookback.
parameters: {"window":12}
CaseOps
Bijective case-transform tokenizer/representation used in the base stack.
parameters: null
SparseAttnGate
Sparse per-head gate inside attention.
parameters: null
RoPE
Rotary positional embeddings used in the transformer.
parameters: {"base":10000,"dims":16}
Regularization
logit softcap
parameters: {"value":30}
Test-Time Training
LoRA TTT
parameters: {"rank":4,"epochs":3,"learning_rate":0.005,"prefix_docs":2000,"phases":3,"score_before_update":true}
Evaluation
sliding window eval
parameters: {"stride":64,"eval_length":2048}
Compression
brotli
level: null
Other
other
PPM-D byte mixture added on top of the base model, using order-5 PPM with binary lambda gating and score-before-update counting.
parameters: {"order":5,"subset_tokens":8000000,"lambda_hi":0.9,"lambda_lo":0.05,"conf_threshold":0.9}
Sequence Length
sequence_length
train_length: null
eval_length: 2048

Novel Contributions

  • Adds a PPM-D byte-mixture layer on top of the PR #1797 base stack.
  • Corrects validation coverage to full 47.85M tokens using --val-docs 50000.
  • Reports both subset-based mix_bpb and full-val neural-only quantized_ttt_phased metrics in one submission.
  • Demonstrates parity with dexhunter PR #1797 on shared seeds for the neural-only path.