PR #871

open

Non-record (WIP): Multi-Order N-gram Backoff — val_bpb=0.8004 (1xH100 proxy)

by greqoneView on GitHub

val_bpb

0.8004

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.18 MB

Training Techniques

Architecture

GQA

Grouped query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

BigramHash

Hashed bigram embedding component.

parameters: {"buckets":4096,"dim":128}

SmearGate

SmearGate gating mechanism.

parameters: null

Value Residual

Value residual pathway in the attention stack.

parameters: null

Gated Attention

Attention mechanism with gating.

parameters: null

XSA

XSA used in the last 4 layers.

parameters: {"layers":4}

Partial RoPE

Partial rotary positional embeddings applied to a subset of dimensions.

parameters: {"train":16,"total":64}

LN Scale

LayerNorm scale modification.

parameters: null

U-Net skip connections

U-Net style skip connections in the model.

parameters: null

weight tying

Tied input and output embeddings.

parameters: null

LeakyReLU

MLP uses LeakyReLU activation with squared variant.

parameters: {"multiplier":"3x","squared":true,"slope":0.5}

Regularization

logit softcap

parameters: {"value":30}

magnitude pruning

parameters: {"sparsity":"3%"}

Quantization

mixed int5/int6

bits: null

scope: MLP/attn

Compression

zstd

level: 22

Optimizer

Muon

weight_decay: 0.04

momentum: 0.92

other_params: {"lr":0.03}

Weight Averaging

EMA

parameters: {"decay":0.997}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Sequence Length

sequence_length

train_length: null

eval_length: null

Evaluation

score-first n-gram backoff

parameters: {"orders":"2-7","entropy_adaptive_alpha":true,"min_count":2,"hash_buckets":4000000}

Novel Contributions

Multi-order backward-looking n-gram backoff evaluation cache
Entropy-adaptive alpha for mixing model and n-gram scores
Score-first legal evaluation that updates cache only after scoring each token
Highest-matching-order backoff from 7-gram to bigram
Proxy-validated 1xH100 run showing 0.8004 val_bpb