PR #909

open

Record: 11-gram Eval Cache + Hedge Mixer (val_bpb: 0.8609)

by sunnypatneediView on GitHub
val_bpb
0.8609
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.9 MB

Training Techniques

Architecture
XSA
Applied XSA across all 11 layers.
parameters: {"layers":11}
Gated Attention
Enabled gated attention in the transformer.
parameters: null
Partial RoPE
Used partial rotary positional embeddings.
parameters: {"dimensions":"16/64"}
weight tying
Tied input and output embeddings.
parameters: null
LeakyReLU
Used LeakyReLU squared MLP activation.
parameters: {"slope":0.5}
VE64
Used value embedding on selected layers.
parameters: {"dimensions":64,"layers":"7-10"}
Quantization
late QAT
bits: null
scope: all
int6
bits: 6
scope: all
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
11-gram eval cache using multi-order n-gram tables with score-first, update-after protocol and entropy-adaptive mixing.
parameters: {"orders":"2-11","buckets_per_order":4194304}
other
Hedge Mixer online multiplicative-weights ensemble blending neural and n-gram predictions.
parameters: {"beta":2}
Regularization
LN scale
parameters: {"formula":"1/sqrt(layer+1)"}

Novel Contributions

  • 11-gram eval cache with entropy-adaptive mixing
  • Score-first, update-after n-gram protocol
  • Order-adaptive entropy gating for higher-order n-gram matches
  • Hedge Mixer online multiplicative-weights ensemble
  • Sliding-window evaluation with n-gram cache replacing TTT