PR #986

open

Record: Packed N-gram + Two-Pass Dirichlet CTW — val_bpb 0.0830 (3-seed mean)

by sofiabodView on GitHub
val_bpb
0.0830
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
5.76 MB

Training Techniques

Architecture
weight tying
Tied embeddings / embedding tying in the base model.
parameters: null
BigramHash
Hash-based n-gram cache component used for backoff and phrase matching.
parameters: {"orders":"2-13"}
SmearGate
Auxiliary gating component mentioned in the base architecture.
parameters: null
VE128
Value residual / VE128 component used in the architecture.
parameters: null
Partial RoPE
Partial rotary positional embedding applied to part of the model.
parameters: {"fraction":"16/64"}
LeakyReLU
LeakyReLU squared activation used in the MLP.
parameters: {"squared":true,"slope":0.5}
Weight Averaging
EMA
parameters: {"decay":0.997}
Tight SWA
parameters: null
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Quantization
int5
bits: 5
scope: per-row
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":32}
two-pass full rescore
parameters: null
Other
other
Packed training n-gram artifact built from all training shards and stored as compressed hash tables for warm-started evaluation.
parameters: {"orders":"2-13","buckets":128000}
other
Hierarchical Dirichlet CTW mixing where each order's posterior becomes the next order's prior.
parameters: {"concentration":5}
other
Phrase cache with variable-length suffix matching.
parameters: {"probe_lengths":[48,36,28,20,16]}

Novel Contributions

  • Packed training n-gram artifact precomputed from all training data and stored in the submission artifact
  • Two-pass full rescore to eliminate cold-start degradation without a second neural forward pass
  • Hierarchical Dirichlet CTW mixing across n-gram orders
  • Ratio-preserving count scaling to keep n-gram statistics within compact integer ranges
  • Variable-length phrase cache with suffix matching
  • Distributed cache prefill for sequential-equivalent distributed evaluation