PR #1538

open

records: David Ghazaryan — MoE + BigramHash4096 (val_bpb 1.11799)

val_bpb

1.1180

Architecture

Transformer

Optimizer

—

Artifact Size

15,891,605 bytes

Training Techniques

Architecture

BigramHash

Expanded bigram hash table for richer local context at the embedding stage.

parameters: {"buckets":4096,"dim":96}

MLP

Mixture-of-Experts MLP with 4 experts and top-2 routing.

parameters: {"experts":4,"top_k":2}

LeakyReLU

MLP activation uses LeakyReLU.

parameters: {"slope":0.5}

XSA

XSA applied across all layers.

parameters: {"layers":11}

Partial RoPE

Partial rotary positional embeddings applied to a subset of dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

Weight Averaging

EMA

parameters: {"decay":0.997}

Regularization

LN scale

parameters: {"formula":"1/sqrt(layer+1)"}

Quantization

GPTQ

bits: null

scope: full model

Compression

lzma

level: null

LR Schedule

warmdown

parameters: {"warmdown_iters":4000}

BigramHash4096 — expanded from SOTA's 3072 to 4096 buckets
MoE MLP — first Mixture-of-Experts exploration in this repo (4 experts, top-2 routing)