PR #1030
openRecord: Single-Pass Packed N-gram + Dirichlet CTW — val_bpb 0.1130 (3-seed mean)
by sofiabodView on GitHub
val_bpb
0.1130
Architecture
Transformer
Optimizer
Muon
Artifact Size
5.76 MB
Training Techniques
Architecture
weight tying
Tied embeddings used in the base Transformer.
parameters: null
RoPE
Rotary positional embeddings applied to a subset of dimensions.
parameters: {"dimensions":16}
SWA
Stochastic weight averaging used during training.
parameters: null
BigramHash
Bigram hash component used in the model stack.
parameters: {"dimensions":128,"buckets":4096}
SmearGate
SmearGate module used in the model stack.
parameters: null
VE128
Value residual/VE128 component used in later layers.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Quantization
int6
bits: 6
scope: per-row
Compression
zstd
level: 22
Regularization
logit softcap
parameters: {"value":30}
Weight Averaging
EMA
parameters: {"decay":0.997}
Tight SWA
parameters: null
Evaluation
sliding window eval
parameters: {"stride":128,"seq_len":2048}
Other
other
Single-pass score-first evaluation with packed multi-order n-gram cache and hierarchical Dirichlet CTW mixing.
parameters: {"orders":"2-13","buckets_per_order":131072,"concentrations":[50,50,20,10,6,4,3,2.5,2,1.8,1.6,1.4]}
Novel Contributions
- Packed multi-order n-gram artifact precomputed from training shards to eliminate cold-start cache issues
- Hierarchical Dirichlet CTW mixing across n-gram orders
- Single-pass score-first evaluation with no two-pass rescore
- Deterministic distributed cache prefill for warm-started evaluation
- Ratio-preserving packed uint16 n-gram counts stored in a compressed artifact