PR #876

open

10L + Two-Pass Order-11 N-gram Backoff (0.5863 BPB)

by BortlesboatView on GitHub
val_bpb
0.5863
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.4-15.6 MB

Training Techniques

Architecture
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
LeakyReLU
LeakyReLU squared MLP activation.
parameters: {"slope":0.5}
Partial RoPE
Partial rotary positional embeddings.
parameters: null
XSA
XSA applied in the last 4 layers.
parameters: {"layers":4}
Value Residual
Value residual connections in the transformer blocks.
parameters: null
Regularization
LN scale
parameters: null
Quantization
mixed int5/int6
bits: 5
scope: MLP and attention
Compression
zstd
level: 22
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.03}
Evaluation
sliding window eval
parameters: {"pass_1":"score-first","pass_2":"frozen cache rescore"}
Other
other
Two-pass order-11 n-gram backoff with hashed cache and entropy gating during evaluation.
parameters: {"orders":[2,11]}
other
Order-adaptive entropy gating that trusts higher-order n-gram matches more when model uncertainty is lower.
parameters: null

Novel Contributions

  • Two-pass evaluation with a frozen-cache rescore of already-evaluated tokens
  • Order-11 hashed n-gram backoff cache with order-adaptive entropy gating
  • Score-first sliding window evaluation that updates cache only after scoring
  • Mixed int5 MLP / int6 attention quantization with zstd compression
  • EMA-averaged training with Muon optimizer and GQA/XSA-based transformer architecture