PR #909

open

Record: 11-gram Eval Cache + Hedge Mixer (val_bpb: 0.8609)

by sunnypatneediView on GitHub

val_bpb

0.8609

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.9 MB

Training Techniques

Architecture

XSA

Applied XSA across all 11 layers.

parameters: {"layers":11}

Gated Attention

Enabled gated attention in the transformer.

parameters: null

Partial RoPE

Used partial rotary positional embeddings.

parameters: {"dimensions":"16/64"}

weight tying

Tied input and output embeddings.

parameters: null

LeakyReLU

Used LeakyReLU squared MLP activation.

parameters: {"slope":0.5}

VE64

Used value embedding on selected layers.

parameters: {"dimensions":64,"layers":"7-10"}

Quantization

late QAT

bits: null

scope: all

int6

bits: 6

scope: all

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_every":50}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Other

other

11-gram eval cache using multi-order n-gram tables with score-first, update-after protocol and entropy-adaptive mixing.

parameters: {"orders":"2-11","buckets_per_order":4194304}

other

Hedge Mixer online multiplicative-weights ensemble blending neural and n-gram predictions.

parameters: {"beta":2}

Regularization

LN scale

parameters: {"formula":"1/sqrt(layer+1)"}

Novel Contributions

11-gram eval cache with entropy-adaptive mixing
Score-first, update-after n-gram protocol
Order-adaptive entropy gating for higher-order n-gram matches
Hedge Mixer online multiplicative-weights ensemble
Sliding-window evaluation with n-gram cache replacing TTT