PR #963

closed

Record: 11-gram Eval Cache + Hedge Mixer (val_bpb: 0.8609)

by sunnypatneediView on GitHub

val_bpb

0.8609

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.8MB

Training Techniques

Architecture

GQA

Grouped query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

XSA

XSA applied across all transformer layers.

parameters: {"layers":11}

Gated Attention

Uses gated attention in the transformer blocks.

parameters: null

Partial RoPE

Applies rotary position embeddings to a subset of dimensions.

parameters: {"dimensions":"16/64"}

weight tying

Tied input and output embeddings.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"interval":50}

Quantization

late QAT

bits: null

scope: all

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

11-gram eval cache

parameters: {"orders":[2,11],"buckets_per_order":4000000}

Other

other

Entropy-adaptive alpha blending between neural model logits and n-gram cache logits.

parameters: null

other

Hedge Mixer online multiplicative-weights ensemble between base model and n-gram-enhanced predictions.

parameters: {"beta":2}

Novel Contributions

11-gram eval cache with entropy-adaptive alpha blending
Hedge Mixer online ensemble between neural and n-gram predictions
Score-first, update-after n-gram cache protocol
Sliding window evaluation combined with multi-order n-gram caching
Eval-time-only improvement with no training objective changes