PR #843

open

Record: Order-12 N-gram Backoff + 256K Chunks — 0.2834 BPB

val_bpb

0.2834

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~13.4 MB

Training Techniques

Quantization

GPTQ

bits: 5

scope: model

Architecture

XSA

Uses XSA-4 as part of the model architecture.

parameters: {"variant":4}

BigramHash

Includes a BigramHash component with 4096 buckets.

parameters: {"buckets":4096}

MLP3x

Uses an expanded MLP with 3.0x width.

parameters: {"multiplier":3}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: null

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: null

Compression

LZMA

level: null

Evaluation

n-gram backoff cache

parameters: {"order":12,"chunk_tokens":256000,"alpha_max":0.7,"hash_primes_added":6}

Other

other

Entropy-adaptive n-gram mixing with per-order multipliers and score-first cache updates after each chunk.

parameters: {"score_first":true,"cache_update_timing":"after scoring each chunk","low_order_multiplier":0.3,"high_order_multiplier":2}