PR #916

open

10L + PPM Full-Rescore Order-12 N-gram (0.3461 BPB)

by BortlesboatView on GitHub

val_bpb

0.3461

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.3-15.6 MB

Training Techniques

Architecture

GQA

Grouped query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

LeakyReLU

LeakyReLU activation used in the MLP, squared twice as indicated by LeakyReLU(0.5)^2.

parameters: {"slope":0.5}

Partial RoPE

Partial rotary positional embeddings applied with a 16/64 split.

parameters: {"ratio":"16/64"}

XSA

XSA used in the last 4 layers.

parameters: {"layers":4}

Value Residual

Value residual connections are included in the architecture.

parameters: null

BigramHash

Bigram hash module with 4096 buckets.

parameters: {"dimensions":4096}

Quantization

mixed int5/int6

bits: null

scope: MLP/attention

Compression

zstd

level: 22

Weight Averaging

EMA

parameters: {"decay":0.997}

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"lr":0.03}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Evaluation

sliding window eval

parameters: null

Test-Time Training

score-first TTT

parameters: {"pass_1":"store per-token model probabilities without n-gram blending","pass_2":"rescore with frozen cache"}

Other

other

PPM-style all-order blend across matching n-gram orders 2-12 using escape probabilities.

parameters: {"orders":"2-12"}

other

Leave-one-out self-exclusion during full-cache rescoring to subtract each token's own contribution before scoring.

parameters: null

Novel Contributions

PPM-style all-order blend across n-gram orders 2-12 instead of hard backoff
Leave-one-out self-exclusion during full-cache rescoring to remove self-inclusion bias
Two-pass score-first evaluation pipeline with frozen cache rescoring
Vectorized cache construction over all tokens using np.bincount