PR #266

open

Non-record: Mixture of Softmax K=2 R=64 (1xH100, 10min, 1.3932 bpb)

val_bpb

1.3932

Architecture

Transformer

Optimizer

—

Artifact Size

12.8 MB

Training Techniques

Architecture

tied embeddings

Uses tied input/output embeddings in the baseline model.

parameters: null

Mixture of Softmax

Replaces the standard tied-embedding softmax with a K=2 mixture of softmaxes to break the softmax bottleneck.

parameters: {"k":2,"rank":64}

Compression

zlib

level: null

Applies Mixture of Softmax (MoS) to the baseline 9x512 architecture.
Uses low-rank factorization with rank 64 to keep parameter overhead minimal.
Demonstrates that MoS adds negligible artifact overhead while remaining within the 16MB budget.
Reports minimal quantization degradation after int8+zlib roundtrip.
Explores the theoretical benefit of lifting the softmax rank limit from d+1 to K*d+? for the full vocabulary dimensionality.