PR #922

open

Record: Order-14 N-gram Full-Rescore — val_bpb 0.0972

by greqoneView on GitHub

val_bpb

0.0972

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.9 MB

Training Techniques

Architecture

BigramHash

Uses hashed n-gram/bigram-style context matching in the model.

parameters: {"dimensions":128,"buckets":4096}

SmearGate

Includes SmearGate as part of the architecture.

parameters: null

Value Residual

Uses value residual connections.

parameters: null

GQA

Grouped query attention with separate query and key/value head counts.

parameters: {"query_heads":8,"kv_heads":8}

ReLU²

MLP uses squared ReLU activations.

parameters: {"mlp_multiplier":3}

XSA

Uses XSA across all layers.

parameters: {"layers":11}

Partial RoPE

Applies partial rotary positional embeddings.

parameters: {"train_eval_ratio":"16/64"}

Regularization

LN scale

parameters: null

logit softcap

parameters: {"value":30}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.92

other_params: {"lr":0.02,"momentum_schedule_end":0.99}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: null

Quantization

mixed int6

bits: 6

scope: model

Compression

lzma

level: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Novel Contributions

Extended n-gram backoff to order-14
Enabled full-rescore two-pass evaluation with stored neural probabilities
Increased alpha max to 0.70 for stronger high-order n-gram trust
Reduced chunk size to 262,144 tokens for more frequent cache updates
Maintained score-first legal evaluation while rescoreing all chunks with a warm cache