PR #948
openTwo-Level Dirichlet Posterior + Phrase Cache — 0.11556 BPB (3-seed)
by dentity007View on GitHub
val_bpb
0.1156
Architecture
Transformer
Optimizer
—
Artifact Size
~15.1 MB
Training Techniques
Architecture
EBLS
3 shared transformer blocks looped 3x plus 2 unique blocks, yielding 11 effective layers
parameters: {"layers":11,"shared_blocks":3,"loops":3,"unique_blocks":2}
GQA
Grouped query attention with fewer KV heads than attention heads
parameters: {"heads":8,"kv_heads":4}
LeakyReLU
MLP uses LeakyReLU squared activation
parameters: {"negative_slope":0.5}
XSA
XSA applied across all layers
parameters: {"layers":11}
Quantization
GPTQ
bits: 6
scope: model weights
Compression
lzma
level: null
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: null
Evaluation
stride-based eval
parameters: {"stride":64}
Regularization
weight decay
parameters: null
Other
other
Two-level Dirichlet-Multinomial posterior mixing across neural, n-gram, and phrase components
parameters: null
other
Per-order OBCL concentrations for n-gram posterior mixing
parameters: {"concentrations":[50,50,6.95,2.98,2.05,2.05,2.05,1.86,1.86,1.86,1.86,1.86,1.86,1.86]}
other
Phrase suffix matching cache with probe lengths 20 and 16
parameters: {"probe_lengths":[20,16]}
other
15-gram backoff cache with 4M hash buckets
parameters: {"order_min":2,"order_max":15,"buckets":4194304}
other
Complementary training on orders 2-5
parameters: {"alpha":0.5,"orders":[2,3,4,5]}
Novel Contributions
- Two-level Dirichlet posterior mixing over neural, n-gram, and phrase predictions
- Per-order OBCL concentration tuning for n-gram smoothing
- Phrase suffix matching cache with multiple probe lengths
- 15-gram backoff with large hash bucket cache
- Complementary training for lower-order n-gram orders
- Combination of GPTQ int6 quantization, EMA, and SWA under the artifact budget