PR #912

closed

10L + PPM Full-Rescore Order-12 N-gram (0.3461 BPB)

by BortlesboatView on GitHub

val_bpb

0.3461

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.3-15.6 MB

Training Techniques

Architecture

GQA

Grouped query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

LeakyReLU

LeakyReLU activation used in the MLP.

parameters: {"slope":0.5}

Partial RoPE

Partial rotary positional embeddings.

parameters: null

XSA

XSA applied in the last 4 layers.

parameters: {"layers":4}

Value Residual

Value residual connections in the model.

parameters: null

Regularization

LN scale

parameters: null

Quantization

mixed int5/int6

bits: null

scope: MLP int5, attention int6

Compression

zstd

level: 22

Weight Averaging

EMA

parameters: {"decay":0.997}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"lr":0.03}

Evaluation

sliding window eval

parameters: null

stride-based eval

parameters: {"stride":64}

Test-Time Training

LoRA TTT

parameters: {"rank":8}

Other

other

PPM-style all-order blend over matching n-gram orders 2-12 using escape probabilities, with leave-one-out self-exclusion during full-rescore.

parameters: {"orders":[2,12]}

Novel Contributions

PPM-style all-order blend across matching n-gram orders 2-12 using escape probabilities
Leave-one-out self-exclusion in full-rescore to remove self-inclusion bias
Two-pass evaluation pipeline with GPU sliding-window scoring, cache build, and full-token rescore
Mixed int5/int6 quantization with zstd compression
Neural cache and per-document LoRA test-time training described in the branch README