PR #1056
openRecord: Packed Causal N-gram + Dirichlet Backoff — val_bpb 0.0180 (3-seed mean)
by sofiabodView on GitHub
val_bpb
0.0180
Architecture
Transformer
Optimizer
Muon
Artifact Size
~1.4 MB
Training Techniques
Architecture
weight tying
Tied input and output embeddings.
parameters: null
RoPE
Uses rotary positional embeddings with a reduced active dimension.
parameters: {"dimensions":16}
SWA
Stochastic weight averaging used during training.
parameters: null
BigramHash
Adds a bigram hash component to the model stack.
parameters: {"buckets":4096}
SmearGate
Uses SmearGate in the architecture.
parameters: null
VE128
Uses VE128 on later layers.
parameters: {"layers":[9,10]}
LeakyReLU
Uses squared LeakyReLU activation.
parameters: {"squared":true,"negative_slope":0.5}
XSA
Applies XSA attention across all layers.
parameters: {"layers":11}
Partial RoPE
Applies RoPE to a subset of dimensions.
parameters: {"dimensions":"16/64"}
KV head count
Uses grouped KV heads.
parameters: {"kv_heads":8}
Weight Averaging
SWA
parameters: null
EMA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Quantization
int6
bits: 6
scope: per-row
Compression
zstd
level: 22
Regularization
logit softcap
parameters: {"value":30}
Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":2048}
Sequence Length
sequence_length
train_length: null
eval_length: 2048
Other
other
Packed causal n-gram cache built from training shards and stored in the artifact for eval-time lookup.
parameters: {"orders":"2-12","buckets_per_order":32768}
other
Dirichlet posterior backoff mixing with count-confidence gating for eval-time blending of neural and n-gram probabilities.
parameters: {"concentrations":[50,50,20,10,6,4,3,2.5,2,1.8,1.6]}
Novel Contributions
- Packed causal n-gram cache precomputed from training shards and stored in the artifact
- Dirichlet posterior backoff mixing with count-confidence gating
- Single-pass score-first evaluation with cache update after lookup
- Distributed prefill to warm caches across ranks before evaluation
- Order-2 to order-12 hash-table backoff with dual hashing