PR #1030

open

Record: Single-Pass Packed N-gram + Dirichlet CTW — val_bpb 0.1130 (3-seed mean)

by sofiabodView on GitHub
val_bpb
0.1130
Architecture
Transformer
Optimizer
Muon
Artifact Size
5.76 MB

Training Techniques

Architecture
weight tying
Tied embeddings used in the base Transformer.
parameters: null
RoPE
Rotary positional embeddings applied to a subset of dimensions.
parameters: {"dimensions":16}
SWA
Stochastic weight averaging used during training.
parameters: null
BigramHash
Bigram hash component used in the model stack.
parameters: {"dimensions":128,"buckets":4096}
SmearGate
SmearGate module used in the model stack.
parameters: null
VE128
Value residual/VE128 component used in later layers.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Quantization
int6
bits: 6
scope: per-row
Compression
zstd
level: 22
Regularization
logit softcap
parameters: {"value":30}
Weight Averaging
EMA
parameters: {"decay":0.997}
Tight SWA
parameters: null
Evaluation
sliding window eval
parameters: {"stride":128,"seq_len":2048}
Other
other
Single-pass score-first evaluation with packed multi-order n-gram cache and hierarchical Dirichlet CTW mixing.
parameters: {"orders":"2-13","buckets_per_order":131072,"concentrations":[50,50,20,10,6,4,3,2.5,2,1.8,1.6,1.4]}

Novel Contributions

  • Packed multi-order n-gram artifact precomputed from training shards to eliminate cold-start cache issues
  • Hierarchical Dirichlet CTW mixing across n-gram orders
  • Single-pass score-first evaluation with no two-pass rescore
  • Deterministic distributed cache prefill for warm-started evaluation
  • Ratio-preserving packed uint16 n-gram counts stored in a compressed artifact