PR #770

open

Record: 11L + Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.6672)

by minh-stakcView on GitHub
val_bpb
0.6672
Architecture
11L Transformer
Optimizer
Artifact Size
15.0 MB

Training Techniques

Architecture
XSA
Uses XSA in the last 4 layers of the 11-layer model.
parameters: {"layers":4}
Partial RoPE
Applies partial rotary positional embeddings with a 16/64 split.
parameters: {"train_length":null,"eval_length":null}
MLP3x
Uses a 3x MLP expansion.
parameters: null
SmearGate
Includes SmearGate as part of the architecture.
parameters: null
BigramHash
Adds BigramHash with 2048 buckets.
parameters: {"buckets":2048}
Weight Averaging
EMA
parameters: {"decay":0.997}
Initialization
OrthoInit
Quantization
int6
bits: 6
scope: per-row
GPTQ-lite
bits: null
scope: all
Regularization
layerwise LN scale
parameters: null
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
Multi-order n-gram backoff cache interpolation during evaluation, using orders 2 through 7 with highest-order-first cascading on miss.
parameters: {"min_order":2,"max_order":7}
other
Entropy-adaptive interpolation weight alpha based on model entropy for blending LM and n-gram cache predictions.
parameters: {"formula":"alpha = 0.05 + 0.55 * sigmoid(2 * (H - 4.0))"}

Novel Contributions

  • Multi-order n-gram backoff cache interpolation (orders 2-7)
  • Entropy-adaptive alpha for blending neural and n-gram predictions
  • Score-first, backward-looking n-gram cache built only from previously scored tokens
  • Single blended prediction per token without min(NLL) selection