PR #889

open

Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean)

by anthony-maioView on GitHub

val_bpb

0.9642

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.95 MB

Training Techniques

Architecture

LeakyReLU

Uses squared LeakyReLU activation in the MLP.

parameters: {"power":2,"slope":0.5}

GQA

Grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

VRL

Value Residual Learning module.

parameters: null

VE128

Value embedding dimension setting.

parameters: {"dimensions":128}

BigramHash

Bigram hash feature with 2048 buckets.

parameters: {"dimensions":2048}

XSA

XSA4 attention/sequence module.

parameters: {"variant":4}

Partial RoPE

Partial rotary positional embedding applied to a subset of dimensions.

parameters: {"train":16,"eval":64}

SmearGate

SmearGate gating mechanism.

parameters: null

U-Net skip connections

U-Net style skip connections in the network.

parameters: null

Weight Averaging

EMA + Tight SWA

parameters: {"decay":0.997}

Quantization

GPTQ-lite

bits: 6

scope: model

Compression

lzma

level: null

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

Initialization

OrthoInit

Orthogonal initialization.

Regularization

LN scale

parameters: null

Evaluation

sliding window eval

parameters: null

Other

other

Entropy-adaptive n-gram backoff cache built causally from already-scored tokens, mixing neural and n-gram probabilities with score-first updates.

parameters: {"orders":"2-7gram","alpha_formula":"0.05 + 0.55 * sigmoid(2*(H-4))","min_count":2,"hash_buckets_per_order":4000000}

Novel Contributions

Entropy-adaptive multi-order n-gram backoff cache
Score-first causal n-gram table updates during evaluation
Linear interpolation of neural and n-gram probabilities based on model entropy
Multi-seed record result with 0.9642 val_bpb mean