PR #797

open

Record: 7-gram N-gram Cache (0.8960 bpb)

by armantsaturianView on GitHub
val_bpb
0.8960
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.92 MB

Training Techniques

Architecture
XSA
Extended XSA to all layers instead of only the last few layers.
parameters: {"layers":11}
MLP3x
Uses a 3x MLP with LeakyReLU(0.5)^2 activation.
parameters: null
BigramHash
Includes a BigramHash component in the model.
parameters: {"size":2048}
RoPE
Uses partial rotary positional embeddings.
parameters: {"dimensions":"16/64"}
tied embeddings
Uses tied FP16 embeddings with softcap.
parameters: {"softcap":30}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"frequency":"every 50 steps"}
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
7-gram n-gram cache
parameters: {"orders":"2-7","backoff_beta":0.000001,"alpha":0.2,"score_before_update":true}
Test-Time Training
disabled
parameters: null
Regularization
layerwise LN scale
parameters: {"formula":"1/sqrt(layer+1)"}
Other
other
LeakyReLU(0.5)^2 activation in the MLP.
parameters: null
other
Streaming single-pass cache built during evaluation with recursive backoff and fixed cache/neural blending.
parameters: {"cache_neural_mix":"80/20","near_zero_backoff_beta":0.000001}

Novel Contributions

  • Streaming single-pass 7-gram n-gram cache applied during evaluation
  • Near-zero backoff beta so longest n-gram match dominates
  • Fixed 80/20 cache-to-neural blending
  • Extended XSA to all 11 layers
  • LeakyReLU(0.5)^2 MLP and Parallel Muon base model
  • Score-before-update cache legality-preserving evaluation