PR #770

open

Record: 11L + Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.6672)

by minh-stakcView on GitHub

val_bpb

0.6672

Architecture

11L Transformer

Optimizer

—

Artifact Size

15.0 MB

Training Techniques

Architecture

XSA

Uses XSA in the last 4 layers of the 11-layer model.

parameters: {"layers":4}

Partial RoPE

Applies partial rotary positional embeddings with a 16/64 split.

parameters: {"train_length":null,"eval_length":null}

MLP3x

Uses a 3x MLP expansion.

parameters: null

SmearGate

Includes SmearGate as part of the architecture.

parameters: null

BigramHash

Adds BigramHash with 2048 buckets.

parameters: {"buckets":2048}

Weight Averaging

EMA

parameters: {"decay":0.997}

Initialization

OrthoInit

Quantization

int6

bits: 6

scope: per-row

GPTQ-lite

bits: null

scope: all

Regularization

layerwise LN scale

parameters: null

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Other

other

Multi-order n-gram backoff cache interpolation during evaluation, using orders 2 through 7 with highest-order-first cascading on miss.

parameters: {"min_order":2,"max_order":7}

other

Entropy-adaptive interpolation weight alpha based on model entropy for blending LM and n-gram cache predictions.

parameters: {"formula":"alpha = 0.05 + 0.55 * sigmoid(2 * (H - 4.0))"}

Novel Contributions

Multi-order n-gram backoff cache interpolation (orders 2-7)
Entropy-adaptive alpha for blending neural and n-gram predictions
Score-first, backward-looking n-gram cache built only from previously scored tokens
Single blended prediction per token without min(NLL) selection