PR #963
closedRecord: 11-gram Eval Cache + Hedge Mixer (val_bpb: 0.8609)
by sunnypatneediView on GitHub
val_bpb
0.8609
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.8MB
Training Techniques
Architecture
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
XSA
XSA applied across all transformer layers.
parameters: {"layers":11}
Gated Attention
Uses gated attention in the transformer blocks.
parameters: null
Partial RoPE
Applies rotary position embeddings to a subset of dimensions.
parameters: {"dimensions":"16/64"}
weight tying
Tied input and output embeddings.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"interval":50}
Quantization
late QAT
bits: null
scope: all
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
11-gram eval cache
parameters: {"orders":[2,11],"buckets_per_order":4000000}
Other
other
Entropy-adaptive alpha blending between neural model logits and n-gram cache logits.
parameters: null
other
Hedge Mixer online multiplicative-weights ensemble between base model and n-gram-enhanced predictions.
parameters: {"beta":2}
Novel Contributions
- 11-gram eval cache with entropy-adaptive alpha blending
- Hedge Mixer online ensemble between neural and n-gram predictions
- Score-first, update-after n-gram cache protocol
- Sliding window evaluation combined with multi-order n-gram caching
- Eval-time-only improvement with no training objective changes