PR #876

open

10L + Two-Pass Order-11 N-gram Backoff (0.5863 BPB)

by BortlesboatView on GitHub

val_bpb

0.5863

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.4-15.6 MB

Training Techniques

Architecture

GQA

Grouped query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

LeakyReLU

LeakyReLU squared MLP activation.

parameters: {"slope":0.5}

Partial RoPE

Partial rotary positional embeddings.

parameters: null

XSA

XSA applied in the last 4 layers.

parameters: {"layers":4}

Value Residual

Value residual connections in the transformer blocks.

parameters: null

Regularization

LN scale

parameters: null

Quantization

mixed int5/int6

bits: 5

scope: MLP and attention

Compression

zstd

level: 22

Weight Averaging

EMA

parameters: {"decay":0.997}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.03}

Evaluation

sliding window eval

parameters: {"pass_1":"score-first","pass_2":"frozen cache rescore"}

Other

other

Two-pass order-11 n-gram backoff with hashed cache and entropy gating during evaluation.

parameters: {"orders":[2,11]}

other

Order-adaptive entropy gating that trusts higher-order n-gram matches more when model uncertainty is lower.

parameters: null

Novel Contributions

Two-pass evaluation with a frozen-cache rescore of already-evaluated tokens
Order-11 hashed n-gram backoff cache with order-adaptive entropy gating
Score-first sliding window evaluation that updates cache only after scoring
Mixed int5 MLP / int6 attention quantization with zstd compression
EMA-averaged training with Muon optimizer and GQA/XSA-based transformer architecture