PR #1379

open

Record: 0.4162 BPB mixed quant ngram (post-fix reruns)

by LucasErcolanoView on GitHub

val_bpb

0.4162

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15,623,718 bytes

Training Techniques

Quantization

mixed int5/int6

bits: null

scope: MLP int5; attention/embeddings int6

Architecture

GQA

Grouped query attention in the base transformer

parameters: {"heads":8,"kv_heads":4}

MLP3x

Expanded MLP width

parameters: {"multiplier":3}

LeakyReLU

LeakyReLU squared activation

parameters: {"squared":true,"slope":0.5}

SmearGate

SmearGate component in the base neural stack

parameters: null

BigramHash

Bigram hash component used in the model

parameters: {"size":2048}

VE128

Value-Residual Embeddings

parameters: {"dimensions":128}

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: null

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":256}

Test-Time Training

score-first TTT

parameters: null

Initialization

OrthoInit

Orthogonal initialization

Regularization

weight decay

parameters: null

Other

other

Complementary training that down-weights tokens easily predicted by n-grams

parameters: {"loss_reweighting":"1 - alpha * p_bigram(token)"}

other

Causal backoff n-gram mixer with entropy-adaptive blending

parameters: null

other

DDP-safe score-first update protocol with synchronization before cache update

parameters: null

Novel Contributions

Post-hash-fix rerun of the mixed quant n-gram record
Mixed precision quantization with int5 MLP weights and int6 attention/embedding weights
Complementary training to focus the neural model on tokens poorly predicted by n-grams
Causal backoff n-gram mixer with entropy-adaptive blending
DDP-safe score-first update protocol for multi-GPU evaluation
Aligned higher-order n-gram hash ordering between update and score