PR #681

open

Non-record: BigramHash(4096) + Cosine EMA + LZMA-9

by AlfaxadView on GitHub

val_bpb

1.4775

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

7.9MB

Training Techniques

Architecture

BigramHash

Expanded bigram hash embedding table to capture richer local context.

parameters: {"vocab_size":4096}

RoPE

Partial rotary positional embeddings applied to a subset of dimensions.

parameters: {"dimensions":"16/64"}

XSA

XSA applied to the last 4 layers.

parameters: {"layers":4}

MLP3x

Three-times MLP with LeakyReLU squared activation.

parameters: {"activation":"LeakyReLU(0.5)^2"}

Weight Averaging

EMA

parameters: {"schedule":"cosine","start_decay":0.99,"end_decay":0.999}

SWA

parameters: {"frequency":50}

Quantization

GPTQ-lite int6

bits: 6

scope: all

QAT

bits: 6

scope: all

Compression

lzma

level: 9

Other

other

Earlier late QAT activation to adapt sooner during warmdown.

parameters: {"threshold":0.1}

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: null

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Novel Contributions

Expanded BigramHash vocabulary from 2048 to 4096
Replaced fixed EMA decay with a cosine EMA schedule from 0.99 to 0.999
Activated late QAT earlier by lowering the threshold from 0.15 to 0.10
Increased LZMA compression preset from 6 to 9
Used ShinkaEvolve with GPT-5.4 and Gemini 3 Pro as mutation operators