PR #797

open

Record: 7-gram N-gram Cache (0.8960 bpb)

by armantsaturianView on GitHub

val_bpb

0.8960

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~15.92 MB

Training Techniques

Architecture

XSA

Extended XSA to all layers instead of only the last few layers.

parameters: {"layers":11}

MLP3x

Uses a 3x MLP with LeakyReLU(0.5)^2 activation.

parameters: null

BigramHash

Includes a BigramHash component in the model.

parameters: {"size":2048}

RoPE

Uses partial rotary positional embeddings.

parameters: {"dimensions":"16/64"}

tied embeddings

Uses tied FP16 embeddings with softcap.

parameters: {"softcap":30}

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: null

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"frequency":"every 50 steps"}

Quantization

GPTQ-lite

bits: 6

scope: all

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

7-gram n-gram cache

parameters: {"orders":"2-7","backoff_beta":0.000001,"alpha":0.2,"score_before_update":true}

Test-Time Training

disabled

parameters: null

Regularization

layerwise LN scale

parameters: {"formula":"1/sqrt(layer+1)"}

Other

other

LeakyReLU(0.5)^2 activation in the MLP.

parameters: null

other

Streaming single-pass cache built during evaluation with recursive backoff and fixed cache/neural blending.

parameters: {"cache_neural_mix":"80/20","near_zero_backoff_beta":0.000001}

Novel Contributions

Streaming single-pass 7-gram n-gram cache applied during evaluation
Near-zero backoff beta so longest n-gram match dominates
Fixed 80/20 cache-to-neural blending
Extended XSA to all 11 layers
LeakyReLU(0.5)^2 MLP and Parallel Muon base model
Score-before-update cache legality-preserving evaluation