PR #727

open

Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed)

by Asukabot0View on GitHub

val_bpb

0.9674

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.99 MB

Training Techniques

Architecture

11L Transformer

11-layer Transformer with 512-dimensional hidden size, GQA, MLP expansion, and several custom architectural components.

parameters: {"layers":11,"d_model":512,"gqa_heads":8,"kv_heads":4,"mlp_multiplier":3}

LeakyReLU

Squared LeakyReLU activation with negative slope 0.5.

parameters: {"negative_slope":0.5}

XSA

Cross-sequence attention across all layers.

parameters: {"last_n":11}

Value Residual

Adds value residual connections.

parameters: null

Gated Attention

Uses gated attention mechanism.

parameters: null

SmearGate

Uses SmearGate embedding/attention component.

parameters: null

BigramHash

Bigram hashing feature with 4096 buckets.

parameters: {"buckets":4096}

Partial RoPE

Applies rotary positional embeddings partially.

parameters: {"dimensions":"16/64"}

tied embeddings

Input and output embeddings are tied.

parameters: null

Regularization

LN Scale

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"warmup_start":0.92,"warmup_steps":1500,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}

Quantization

int6

bits: 6

scope: per-row weights

Compression

zstd

level: 16

Evaluation

n-gram eval cache

parameters: {"orders":"2-7","backoff":true,"entropy_adaptive_alpha":true}

Other

other

Entropy-adaptive interpolation between LM logits and n-gram statistics using model entropy.

parameters: {"alpha_formula":"0.05 + 0.55 * sigmoid(2 * (H - 4.0))"}

Novel Contributions

Multi-order n-gram backoff over orders 2 through 7
Entropy-adaptive alpha for interpolating LM and n-gram statistics
First legal sub-1.0 BPB record claim
Score-first, backward-looking eval-time n-gram cache