PR #961

open

Record: 0.0881 BPB — 11L Int5 GPTQ + Order-12 N-gram + Phrase Cache + 65K Chunks

by callithyiaView on GitHub

val_bpb

0.0881

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~13.0 MB

Training Techniques

Architecture

GQA

Grouped query attention with 8 query heads and 4 KV heads.

parameters: {"query_heads":8,"kv_heads":4}

BigramHash

Bigram hash module used in the architecture.

parameters: {"dimensions":128}

XSA

XSA-4 architectural component.

parameters: {"variant":4}

Partial RoPE

Partial rotary positional embeddings applied to a subset of dimensions.

parameters: {"numerator":16,"denominator":64}

SmearGate

SmearGate component in the model architecture.

parameters: null

VE128

VE128 applied on layers 9-10.

parameters: {"layers":[9,10]}

LeakyReLU

LeakyReLU(0.5) squared activation in the MLP.

parameters: {"negative_slope":0.5,"squared":true}

weight tying

Tied embeddings.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"checkpoints":15,"final_warmdown":true}

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"parameter_banking":true}

Quantization

GPTQ

bits: 5

scope: all

Compression

lzma

level: 9

Evaluation

order-12 n-gram cache

parameters: {"orders":[2,3,4,5,6,7,8,9,10,11,12],"backoff":true,"score_first":true,"backward_looking":true}

long phrase cache

parameters: {"probe_lengths":[64,56,48,36,28,20,16]}

temperature sharpening

parameters: {"temperature":0.85}

65K chunking

parameters: {"chunk_size":65000}

Regularization

logit softcap

parameters: {"value":30}

LR Schedule

warmdown

parameters: null

Sequence Length

sequence_length

train_length: null

eval_length: 65000

Novel Contributions

Order-12 backoff n-gram cache combined with a long phrase cache
Entropy-adaptive alpha for cache blending
Temperature sharpening at T=0.85
65K-token chunking to keep evaluation under the 600s budget
Demonstration that cache-heavy evaluation can largely erase large pre-quantization model-quality gaps