PR #887

closed

Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean)

by anthony-maioView on GitHub

val_bpb

0.9642

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.95 MB

Training Techniques

Architecture

LeakyReLU

LeakyReLU squared MLP activation used in the feedforward network.

parameters: {"power":2,"slope":0.5}

VRL

Value Residual Learning applied in the model.

parameters: null

VE128

Value embedding dimension set to 128.

parameters: {"dimensions":128}

BigramHash

Bigram hash feature with 2048 buckets.

parameters: {"buckets":2048}

XSA

XSA attention variant used in the architecture.

parameters: null

Partial RoPE

Partial rotary positional embeddings applied to a subset of dimensions.

parameters: {"train_length":16,"eval_length":64}

SmearGate

SmearGate component included in the model.

parameters: null

U-Net skip connections

U-Net style skip connections in the network.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Tight SWA

parameters: null

Quantization

GPTQ-lite

bits: 6

scope: all

Compression

lzma

level: null

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

Regularization

LN scale

parameters: null

weight decay

parameters: {"value":0.04}

Evaluation

sliding window eval

parameters: null

Test-Time Training

score-first TTT

parameters: {"ngram_backoff":true,"orders":"2-7"}

Novel Contributions

Multi-order causal n-gram backoff cache built from already-scored tokens
Entropy-adaptive mixing between neural predictions and n-gram predictions
Highest-order-wins backoff across 2-7 gram contexts with min_count gating
Score-first evaluation compliance with post-token table updates only
Combination of VRL, LeakyReLU², and compressed GPTQ-lite int6 model artifact