PR #900

open

Record: Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.1156 BPB

by Robby955View on GitHub

val_bpb

0.1156

Architecture

Transformer

Optimizer

—

Artifact Size

14.9 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

Weight Averaging

EMA + SWA

parameters: {"decay":0.997,"swa_every":50}

Compression

lzma

level: 9

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: null

eval_length: null

LR Schedule

warmdown

parameters: null

Architecture

XSA

Cross-layer/shared transformer blocks used as part of the EBLS architecture.

parameters: {"layers":11}

GQA

Grouped query attention with 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

LeakyReLU

LeakyReLU(0.5)^2 MLP activation.

parameters: null

depth recurrence

3 shared transformer blocks looped 3 times for 9 effective layers plus 2 unique layers.

parameters: {"shared_blocks":3,"loops":3,"total_layers":11}

MLP3x

Three-times-expanded MLP.

parameters: null

Other

other

Two-level Dirichlet posterior mixing: neural -> n-gram -> phrase hierarchy.

parameters: {"phrase_probes":[20,16],"ngram_orders":[2,15]}

other

Per-order concentration learning via Bayesian Online Concentration Learning (OBCL).

parameters: {"orders":[2,15]}

other

Complementary training to downweight loss on n-gram-predictable tokens.

parameters: {"alpha":0.5,"orders":[2,5],"warmup_steps":200}

Novel Contributions

Two-level Dirichlet posterior mixing with a neural base measure
Per-order OBCL-learned concentrations for n-gram backoff
Dirichlet-smoothed phrase suffix matching at probe lengths 20 and 16
Demonstration that Dirichlet mixing is far superior to linear interpolation at the phrase level
Combined neural -> n-gram -> phrase Bayesian hierarchy