PR #1881

open

Record: PR #1797 base + PPM-D byte mixture (v2, full-val coverage) — mix_bpb 0.9019 / quantized_ttt 1.0621

by ndokutovichView on GitHub

val_bpb

0.9019

Architecture

Transformer

Optimizer

—

Artifact Size

15.95 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: model weights

Architecture

GQA

Uses grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

SmearGate

Causal content-conditioned attention gate with 1-token lookback.

parameters: {"window":12}

CaseOps

Bijective case-transform tokenizer/representation used in the base stack.

parameters: null

SparseAttnGate

Sparse per-head gate inside attention.

parameters: null

RoPE

Rotary positional embeddings used in the transformer.

parameters: {"base":10000,"dims":16}

Regularization

logit softcap

parameters: {"value":30}

Test-Time Training

LoRA TTT

parameters: {"rank":4,"epochs":3,"learning_rate":0.005,"prefix_docs":2000,"phases":3,"score_before_update":true}

Evaluation

sliding window eval

parameters: {"stride":64,"eval_length":2048}

Compression

brotli

level: null

Other

other

PPM-D byte mixture added on top of the base model, using order-5 PPM with binary lambda gating and score-before-update counting.

parameters: {"order":5,"subset_tokens":8000000,"lambda_hi":0.9,"lambda_lo":0.05,"conf_threshold":0.9}

Sequence Length

sequence_length

train_length: null

eval_length: 2048

Novel Contributions

Adds a PPM-D byte-mixture layer on top of the PR #1797 base stack.
Corrects validation coverage to full 47.85M tokens using --val-docs 50000.
Reports both subset-based mix_bpb and full-val neural-only quantized_ttt_phased metrics in one submission.
Demonstrates parity with dexhunter PR #1797 on shared seeds for the neural-only path.