PR #890

open

Record: Order-Adaptive 9-gram Backoff + Distributed Prefill — val_bpb 0.4405 (3-seed mean)

by sofiabodView on GitHub

val_bpb

0.4405

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15,101,371 bytes

Training Techniques

Architecture

XSA

XSA applied across all layers of the transformer.

parameters: {"layers":11}

LeakyReLU

LeakyReLU squared activation in the MLP.

parameters: {"slope":0.5}

Partial RoPE

Partial rotary position embeddings.

parameters: {"dimensions":"16/64"}

BigramHash

Bigram hash module used in the architecture.

parameters: {"buckets":4096,"dimensions":128}

SmearGate

SmearGate component used in the architecture.

parameters: null

VE128

Value residual enhancement on later layers.

parameters: {"layers":[9,10]}

weight tying

Tied input and output embeddings.

parameters: null

Regularization

logit softcap

parameters: {"value":30}

Weight Averaging

EMA + Tight SWA

parameters: {"ema_decay":0.997}

Quantization

int5

bits: 5

scope: per-row

QAT

bits: null

scope: all

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":32}

Other

other

Order-adaptive entropy-gated multi-order n-gram backoff cache with per-order thresholds and distributed cache prefill during evaluation.

parameters: {"orders":"2-9","hash_buckets_per_order":4000000,"min_count":2,"alpha_range":[0.05,0.6]}

Novel Contributions

Order-adaptive entropy-gated 9-gram backoff cache
Per-order entropy thresholds for blending neural and n-gram predictions
Distributed cache prefill to avoid cold-start caches across ranks
Score-first backward-looking evaluation cache
Multi-order backoff from 2-gram through 9-gram