PR #1076

closed

Record: Packed Causal N-gram + Dirichlet Backoff — val_bpb 0.0109 (3-seed mean, NEW SOTA)

val_bpb

0.0109

Architecture

Transformer

Optimizer

—

Artifact Size

~1.5 MB

Training Techniques

Architecture

GPT

2-layer 128d GPT used as a vestigial neural backbone alongside n-gram cache scoring.

parameters: {"layers":2,"dimensions":128}

Evaluation

sliding window eval

parameters: {"stride":64,"seq_len":2048}

Other

other

Packed causal n-gram cache with precomputed order 2-12 hash tables stored in 32K buckets and zstd-compressed in the artifact.

parameters: {"orders":"2-12","buckets_per_order":32768}

other

Dirichlet posterior backoff mixing with greedy highest-order-first backoff and count-confidence gating.

parameters: {"concentrations":[50,50,20,10,6,4,3,2.5,2,1.8,1.6],"confidence_scale":12}

other

Score-first evaluation with lookup before update and distributed prefill for warm cache start.

parameters: null

Sequence Length

sequence_length

train_length: null

eval_length: 2048