PR #1056

open

Record: Packed Causal N-gram + Dirichlet Backoff — val_bpb 0.0180 (3-seed mean)

by sofiabodView on GitHub

val_bpb

0.0180

Architecture

Transformer

Optimizer

Muon

Artifact Size

~1.4 MB

Training Techniques

Architecture

weight tying

Tied input and output embeddings.

parameters: null

RoPE

Uses rotary positional embeddings with a reduced active dimension.

parameters: {"dimensions":16}

SWA

Stochastic weight averaging used during training.

parameters: null

BigramHash

Adds a bigram hash component to the model stack.

parameters: {"buckets":4096}

SmearGate

Uses SmearGate in the architecture.

parameters: null

VE128

Uses VE128 on later layers.

parameters: {"layers":[9,10]}

LeakyReLU

Uses squared LeakyReLU activation.

parameters: {"squared":true,"negative_slope":0.5}

XSA

Applies XSA attention across all layers.

parameters: {"layers":11}

Partial RoPE

Applies RoPE to a subset of dimensions.

parameters: {"dimensions":"16/64"}

KV head count

Uses grouped KV heads.

parameters: {"kv_heads":8}

Weight Averaging

SWA

parameters: null

EMA

parameters: {"decay":0.997}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Quantization

int6

bits: 6

scope: per-row

Compression

zstd

level: 22

Regularization

logit softcap

parameters: {"value":30}

Evaluation

sliding window eval

parameters: {"stride":64,"seq_len":2048}

Sequence Length

sequence_length

train_length: null

eval_length: 2048

Other

other

Packed causal n-gram cache built from training shards and stored in the artifact for eval-time lookup.

parameters: {"orders":"2-12","buckets_per_order":32768}

other

Dirichlet posterior backoff mixing with count-confidence gating for eval-time blending of neural and n-gram probabilities.

parameters: {"concentrations":[50,50,20,10,6,4,3,2.5,2,1.8,1.6]}

Novel Contributions

Packed causal n-gram cache precomputed from training shards and stored in the artifact
Dirichlet posterior backoff mixing with count-confidence gating
Single-pass score-first evaluation with cache update after lookup
Distributed prefill to warm caches across ranks before evaluation
Order-2 to order-12 hash-table backoff with dual hashing