PR #1026

open

N-gram Cache + Entropy-Adaptive Alpha: 1.0945 BPB

by danielxmedView on GitHub

val_bpb

1.0945

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.99 MB

Training Techniques

Architecture

BigramHash

Uses a bigram hash component in the model stack.

parameters: {"size":1536}

XSA

Applies XSA to the last layers of the model.

parameters: {"layers":4}

RoPE

Uses partial rotary positional embeddings.

parameters: {"dimensions":16,"total_dimensions":64}

VE128

Adds value residual enhancement in selected layers.

parameters: {"layers":[9,10],"dimension":128}

MLP3x

Uses a 3x MLP stack.

parameters: null

LeakyReLU

Uses LeakyReLU squared activation.

parameters: {"squared":true,"negative_slope":0.5}

Weight Averaging

EMA + Tight SWA

parameters: {"ema_decay":0.997,"swa_every":50}

Quantization

GPTQ-lite

bits: 6

scope: model

Compression

lzma

level: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"adam_weight_decay":0.04,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Evaluation

sliding window eval

parameters: {"stride":64}

Other

other

N-gram cache with entropy-adaptive alpha interpolates byte-level N-gram predictions with model logits during evaluation.

parameters: {"max_order":7,"alpha":0.5,"nll_threshold":2.5,"adaptive_range":[0.1,2],"backoff":"strict"}

Test-Time Training

TTT

parameters: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Novel Contributions

N-gram cache replaces TTT for evaluation-time adaptation
Entropy-adaptive alpha scales cache interpolation by token uncertainty
Strict backoff N-gram cache with order 7 to 2
CPU-overlapped N-gram scoring alongside GPU sliding window evaluation
Achieves 1.0945 BPB with 3-seed consistency and sub-16MB artifacts