PR #806

open

Record: Backoff N-gram Cache + LeakyReLU(0.9)² (val_bpb=0.6678)

by ibarrajoView on GitHub

val_bpb

0.6678

Architecture

Transformer

Optimizer

Muon

Artifact Size

8.6MB

Training Techniques

Architecture

SmearGate

Added SmearGate to the Transformer architecture.

parameters: null

BigramHash

Added a BigramHash component with a vocabulary size of 2048.

parameters: {"vocab_size":2048}

MLP3x

Uses a 3x MLP expansion in the Transformer.

parameters: null

tied embeddings

Uses tied embeddings / weight tying.

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"warmup_start":0.92,"warmup_steps":1500}

AdamW

weight_decay: 0.04

momentum: null

other_params: null

Evaluation

sliding window eval

parameters: {"stride":64}

multi-order backoff n-gram cache

parameters: {"orders":[2,3,4,5,6,7],"entropy_adaptive_alpha":true,"score_first":true,"min_count":2}

distributed cache pre-fill

parameters: {"multi_gpu":true,"rank":7,"prefill_tokens":54000000,"prefill_time_seconds":68}

Other

other

LeakyReLU(0.9)^2 activation replacing relu^2.

parameters: {"slope":0.9}

other

Entropy-adaptive alpha mixing between model softmax and n-gram cache probabilities.

parameters: {"alpha_formula":"0.05 + 0.55 * sigmoid(2.0 * (H - 4.0))"}

Novel Contributions

Multi-order backoff n-gram eval cache with orders 2-7
Entropy-adaptive alpha mixing between neural predictions and n-gram cache probabilities
Distributed cache pre-fill for multi-GPU coherence
LeakyReLU(0.9)^2 activation replacing relu^2
Score-first legality: scoring every token under inference_mode before cache update
Removal of illegal pre-eval test-time training