PR #900
openRecord: Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.1156 BPB
by Robby955View on GitHub
val_bpb
0.1156
Architecture
Transformer
Optimizer
—
Artifact Size
14.9 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: all
Weight Averaging
EMA + SWA
parameters: {"decay":0.997,"swa_every":50}
Compression
lzma
level: 9
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: null
eval_length: null
LR Schedule
warmdown
parameters: null
Architecture
XSA
Cross-layer/shared transformer blocks used as part of the EBLS architecture.
parameters: {"layers":11}
GQA
Grouped query attention with 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
LeakyReLU
LeakyReLU(0.5)^2 MLP activation.
parameters: null
depth recurrence
3 shared transformer blocks looped 3 times for 9 effective layers plus 2 unique layers.
parameters: {"shared_blocks":3,"loops":3,"total_layers":11}
MLP3x
Three-times-expanded MLP.
parameters: null
Other
other
Two-level Dirichlet posterior mixing: neural -> n-gram -> phrase hierarchy.
parameters: {"phrase_probes":[20,16],"ngram_orders":[2,15]}
other
Per-order concentration learning via Bayesian Online Concentration Learning (OBCL).
parameters: {"orders":[2,15]}
other
Complementary training to downweight loss on n-gram-predictable tokens.
parameters: {"alpha":0.5,"orders":[2,5],"warmup_steps":200}
Novel Contributions
- Two-level Dirichlet posterior mixing with a neural base measure
- Per-order OBCL-learned concentrations for n-gram backoff
- Dirichlet-smoothed phrase suffix matching at probe lengths 20 and 16
- Demonstration that Dirichlet mixing is far superior to linear interpolation at the phrase level
- Combined neural -> n-gram -> phrase Bayesian hierarchy