PR #972

open

Normalized N-gram + Bayesian First-Match (val_bpb 0.3922)

by Idan3011View on GitHub

val_bpb

0.3922

Architecture

Transformer

Optimizer

Muon

Artifact Size

14.94 MB

Training Techniques

Quantization

QAT

bits: 6

scope: all

Compression

lzma

level: null

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"matrix_lr":0.025}

Weight Averaging

EMA

parameters: {"decay":0.997}

Architecture

weight tying

Tied input and output embeddings.

parameters: null

U-Net skip connections

U-Net style skip connections in the model architecture.

parameters: null

SmearGate

Per-dimension gate blending each token with the previous token.

parameters: null

BigramHash

Hash-table embedding for token bigrams.

parameters: {"dimensions":"2048x128"}

XSA

Exclusive self-attention applied to the last layers to reduce self-value bias.

parameters: {"layers":4}

MLP3x

Wider MLP with 3x expansion.

parameters: null

GQA

Grouped query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

GELU pre-enrichment

Wider nonlinear pre-enrichment block before transformer layers.

parameters: {"dimensions":[512,768,512]}

Evaluation

sliding window eval

parameters: null

Test-Time Training

score-first TTT

parameters: null

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Regularization

weight decay

parameters: {"value":0.04}

Novel Contributions

Full-vocab 1024-token normalized n-gram scoring across all tokens
Bayesian first-match blending with a neural prior
Collision premium analysis showing inflated pseudo-probabilities from hash collisions
Fixed 0.5 blend outperforming adaptive gating schemes
Two-phase shared n-gram cache with global sequential cache construction
GELU pre-enrichment block
XSA on the last 4 layers