PR #1538
openrecords: David Ghazaryan — MoE + BigramHash4096 (val_bpb 1.11799)
by davie2009khView on GitHub
val_bpb
1.1180
Architecture
Transformer
Optimizer
—
Artifact Size
15,891,605 bytes
Training Techniques
Architecture
BigramHash
Expanded bigram hash table for richer local context at the embedding stage.
parameters: {"buckets":4096,"dim":96}
MLP
Mixture-of-Experts MLP with 4 experts and top-2 routing.
parameters: {"experts":4,"top_k":2}
LeakyReLU
MLP activation uses LeakyReLU.
parameters: {"slope":0.5}
XSA
XSA applied across all layers.
parameters: {"layers":11}
Partial RoPE
Partial rotary positional embeddings applied to a subset of dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
Weight Averaging
EMA
parameters: {"decay":0.997}
Regularization
LN scale
parameters: {"formula":"1/sqrt(layer+1)"}
Quantization
GPTQ
bits: null
scope: full model
Compression
lzma
level: null
LR Schedule
warmdown
parameters: {"warmdown_iters":4000}
Novel Contributions
- BigramHash4096 — expanded from SOTA's 3072 to 4096 buckets
- MoE MLP — first Mixture-of-Experts exploration in this repo (4 experts, top-2 routing)