PR #948

open

Two-Level Dirichlet Posterior + Phrase Cache — 0.11556 BPB (3-seed)

by dentity007View on GitHub

val_bpb

0.1156

Architecture

Transformer

Optimizer

—

Artifact Size

~15.1 MB

Training Techniques

Architecture

EBLS

3 shared transformer blocks looped 3x plus 2 unique blocks, yielding 11 effective layers

parameters: {"layers":11,"shared_blocks":3,"loops":3,"unique_blocks":2}

GQA

Grouped query attention with fewer KV heads than attention heads

parameters: {"heads":8,"kv_heads":4}

LeakyReLU

MLP uses LeakyReLU squared activation

parameters: {"negative_slope":0.5}

XSA

XSA applied across all layers

parameters: {"layers":11}

Quantization

GPTQ

bits: 6

scope: model weights

Compression

lzma

level: null

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: null

Evaluation

stride-based eval

parameters: {"stride":64}

Regularization

weight decay

parameters: null

Other

other

Two-level Dirichlet-Multinomial posterior mixing across neural, n-gram, and phrase components

parameters: null

other

Per-order OBCL concentrations for n-gram posterior mixing

parameters: {"concentrations":[50,50,6.95,2.98,2.05,2.05,2.05,1.86,1.86,1.86,1.86,1.86,1.86,1.86]}

other

Phrase suffix matching cache with probe lengths 20 and 16

parameters: {"probe_lengths":[20,16]}

other

15-gram backoff cache with 4M hash buckets

parameters: {"order_min":2,"order_max":15,"buckets":4194304}

other

Complementary training on orders 2-5

parameters: {"alpha":0.5,"orders":[2,3,4,5]}

Novel Contributions

Two-level Dirichlet posterior mixing over neural, n-gram, and phrase predictions
Per-order OBCL concentration tuning for n-gram smoothing
Phrase suffix matching cache with multiple probe lengths
15-gram backoff with large hash bucket cache
Complementary training for lower-order n-gram orders
Combination of GPTQ int6 quantization, EMA, and SWA under the artifact budget