PR #890

open

Record: Order-Adaptive 9-gram Backoff + Distributed Prefill — val_bpb 0.4405 (3-seed mean)

by sofiabodView on GitHub
val_bpb
0.4405
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15,101,371 bytes

Training Techniques

Architecture
XSA
XSA applied across all layers of the transformer.
parameters: {"layers":11}
LeakyReLU
LeakyReLU squared activation in the MLP.
parameters: {"slope":0.5}
Partial RoPE
Partial rotary position embeddings.
parameters: {"dimensions":"16/64"}
BigramHash
Bigram hash module used in the architecture.
parameters: {"buckets":4096,"dimensions":128}
SmearGate
SmearGate component used in the architecture.
parameters: null
VE128
Value residual enhancement on later layers.
parameters: {"layers":[9,10]}
weight tying
Tied input and output embeddings.
parameters: null
Regularization
logit softcap
parameters: {"value":30}
Weight Averaging
EMA + Tight SWA
parameters: {"ema_decay":0.997}
Quantization
int5
bits: 5
scope: per-row
QAT
bits: null
scope: all
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":32}
Other
other
Order-adaptive entropy-gated multi-order n-gram backoff cache with per-order thresholds and distributed cache prefill during evaluation.
parameters: {"orders":"2-9","hash_buckets_per_order":4000000,"min_count":2,"alpha_range":[0.05,0.6]}

Novel Contributions

  • Order-adaptive entropy-gated 9-gram backoff cache
  • Per-order entropy thresholds for blending neural and n-gram predictions
  • Distributed cache prefill to avoid cold-start caches across ranks
  • Score-first backward-looking evaluation cache
  • Multi-order backoff from 2-gram through 9-gram