PR #986

open

Record: Packed N-gram + Two-Pass Dirichlet CTW — val_bpb 0.0830 (3-seed mean)

by sofiabodView on GitHub

val_bpb

0.0830

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

5.76 MB

Training Techniques

Architecture

weight tying

Tied embeddings / embedding tying in the base model.

parameters: null

BigramHash

Hash-based n-gram cache component used for backoff and phrase matching.

parameters: {"orders":"2-13"}

SmearGate

Auxiliary gating component mentioned in the base architecture.

parameters: null

VE128

Value residual / VE128 component used in the architecture.

parameters: null

Partial RoPE

Partial rotary positional embedding applied to part of the model.

parameters: {"fraction":"16/64"}

LeakyReLU

LeakyReLU squared activation used in the MLP.

parameters: {"squared":true,"slope":0.5}

Weight Averaging

EMA

parameters: {"decay":0.997}

Tight SWA

parameters: null

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: null

Quantization

int5

bits: 5

scope: per-row

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":32}

two-pass full rescore

parameters: null

Other

other

Packed training n-gram artifact built from all training shards and stored as compressed hash tables for warm-started evaluation.

parameters: {"orders":"2-13","buckets":128000}

other

Hierarchical Dirichlet CTW mixing where each order's posterior becomes the next order's prior.

parameters: {"concentration":5}

other

Phrase cache with variable-length suffix matching.

parameters: {"probe_lengths":[48,36,28,20,16]}

Novel Contributions

Packed training n-gram artifact precomputed from all training data and stored in the submission artifact
Two-pass full rescore to eliminate cold-start degradation without a second neural forward pass
Hierarchical Dirichlet CTW mixing across n-gram orders
Ratio-preserving count scaling to keep n-gram statistics within compact integer ranges
Variable-length phrase cache with suffix matching
Distributed cache prefill for sequential-equivalent distributed evaluation