PR #840

open

Record: 0.2873 BPB — Fine-Grained N-gram Cache (65K chunks)

by quietsmileView on GitHub

val_bpb

0.2873

Architecture

11L 512d GQA 8/4 Transformer

Optimizer

Parallel Muon

Artifact Size

~13.4 MB

Training Techniques

Quantization

GPTQ

bits: 5

scope: all

Architecture

GQA

Grouped-query attention with 8/4 head configuration

parameters: {"heads":"8/4"}

MLP3x

Expanded MLP width to 3.0x

parameters: {"multiplier":3}

XSA-4

Uses XSA-4 architectural component

parameters: {"variant":4}

LeakyReLU

LeakyReLU activation with squared usage

parameters: {"slope":0.9,"power":2}

BigramHash

Bigram hash component for n-gram-related modeling

parameters: {"buckets":4096}

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: null

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: null

Compression

lzma

level: null

Evaluation

fine-grained n-gram cache chunked evaluation

parameters: {"chunk_tokens":65536,"backoff_order":"2-9","hash_buckets":4000000,"score_first":true,"cache_update_after_chunk":true}

Other

other

Entropy-adaptive alpha for n-gram backoff cache, varying by model confidence and n-gram order

parameters: null

other

Per-order multipliers for n-gram cache, suppressing low orders and boosting high orders

parameters: {"low_orders_multiplier":0.3,"high_orders_multiplier":2}

other

Perplexity-sorted shard ordering during training

parameters: null

Novel Contributions

Reducing NGRAM_EVAL_CHUNK_TOKENS from 1,000,000 to 65,536 for much more frequent n-gram cache updates
Demonstrating that cache update frequency is the dominant factor in n-gram BPB performance
Score-first evaluation where the cache is updated only after each chunk is fully scored
Fine-grained backward-looking n-gram cache evaluation without TTT or additional compute