PR #968

open

Record: Order-20 Dirichlet Posterior + Phrase Cache — 0.11545 BPB (3-seed)

by dentity007View on GitHub
val_bpb
0.1154
Architecture
Transformer
Optimizer
Artifact Size
~15.1 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: model weights
Weight Averaging
EMA + SWA
parameters: {"decay":0.997}
Architecture
GQA
Uses grouped query attention with 4 KV heads.
parameters: {"kv_heads":4}
LeakyReLU
MLP uses LeakyReLU(0.5)^2 activation.
parameters: {"slope":0.5}
XSA
Applied XSA on all 11 layers.
parameters: {"layers":11}
Compression
lzma
level: null
Evaluation
score-first evaluation
parameters: null
stride-based eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":4000}
Other
other
Order-20 n-gram backoff with per-order Dirichlet concentrations and phrase suffix cache.
parameters: {"ngram_order":20,"phrase_probe_lengths":[20,16]}
other
Complementary training on lower orders.
parameters: {"alpha":0.5,"orders":[2,5]}

Novel Contributions

  • Extended n-gram backoff from order 15 to order 20
  • Added per-order OBCL concentrations for higher-order n-grams
  • Used phrase suffix matching / phrase cache at probe lengths 20 and 16
  • Validated improvement with a 6-test ablation and 3-seed evaluation