PR #1359

closed

Record: 0.4188 BPB mixed quant ngram

by LucasErcolanoView on GitHub

val_bpb

0.4188

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15.66 MB

Training Techniques

Quantization

mixed int5/int6

bits: null

scope: MLP int5; attention/embeddings int6

Architecture

GQA

Grouped query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

MLP3x

Transformer MLP widened to 3.0x.

parameters: {"multiplier":3}

LeakyReLU

MLP activation uses LeakyReLU(0.5)^2.

parameters: {"slope":0.5}

SmearGate

SmearGate module included in the base neural stack.

parameters: null

BigramHash

BigramHash used in the model stack and n-gram context handling.

parameters: {"size":2048}

VE128

Value-Residual Embeddings with 128 dimensions.

parameters: {"dimensions":128}

Initialization

OrthoInit

Orthogonal initialization used in the base stack.

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: null

Compression

lzma

level: null

Other

other

Complementary training that down-weights tokens easily predicted by bigram statistics.

parameters: {"loss_weighting":"1 - alpha * p_bigram(token)"}

other

Causal backoff n-gram mixer with entropy-adaptive blending.

parameters: null

other

Score-first, DDP-safe evaluation protocol with synchronization before cache updates.

parameters: {"ddp_safe":true,"score_first":true}

Novel Contributions

Mixed precision quantization with int5 MLP weights and int6 attention/embedding weights
Complementary training focused on tokens poorly predicted by n-grams
Strictly causal backoff n-gram mixer with entropy-adaptive blending
Score-first, DDP-safe cache update protocol for multi-GPU evaluation
Artifact compression with lzma to fit within the 16MB limit