PR #921

open

Record: Order-13 Full-Rescore N-gram + 11L Int6 GPTQ — val_bpb 0.0939 (3-seed mean)

by TimPietruskyView on GitHub

val_bpb

0.0939

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.8MB

Training Techniques

Architecture

Gated Attention

Attention mechanism modified with gating.

parameters: {"layers":11,"dim":512,"heads":8,"kv_heads":4}

Value Residual

Adds value residual connections and value embeddings in later layers.

parameters: {"value_embedding_layers":[8,9,10]}

GQA

Grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

Partial RoPE

Uses rotary position embeddings on only part of the head dimension.

parameters: {"dimensions":64}

BigramHash

Hash-based bigram embedding with tied embeddings.

parameters: {"vocab":1024,"dim":256}

weight tying

Input and output embeddings are tied.

parameters: null

LeakyReLU

MLP uses LeakyReLU squared activation.

parameters: {"negative_slope":0.5}

Weight Averaging

EMA

parameters: {"decay":0.997}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"matrix_lr":0.05}

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":"embeddings/scalars"}

Quantization

GPTQ

bits: 6

scope: all

late QAT

bits: null

scope: null

Compression

lzma

level: 8

Regularization

magnitude pruning

parameters: {"prune_rate":0.05}

logit softcap

parameters: {"value":20}

Other

other

Two-pass order-13 backward-looking n-gram evaluation cache with entropy-adaptive mixing and full-rescore pass.

parameters: {"order":13,"passes":2,"entropy_center":3,"entropy_scale":2}

Novel Contributions

Two-pass order-13 backward-looking n-gram eval cache
Full-rescore pass using the complete cache without additional forward passes
Entropy-adaptive mixing between model probabilities and n-gram cache
Int6 GPTQ with descending actorder and dead-column handling
Pure NumPy vectorized cache implementation with XOR-of-products hashing and np.bincount updates
Artifact compression with lzma to fit int6 model within the submission limit