PR #266
openNon-record: Mixture of Softmax K=2 R=64 (1xH100, 10min, 1.3932 bpb)
by User123331View on GitHub
val_bpb
1.3932
Architecture
Transformer
Optimizer
—
Artifact Size
12.8 MB
Training Techniques
Architecture
tied embeddings
Uses tied input/output embeddings in the baseline model.
parameters: null
Mixture of Softmax
Replaces the standard tied-embedding softmax with a K=2 mixture of softmaxes to break the softmax bottleneck.
parameters: {"k":2,"rank":64}
Compression
zlib
level: null
Novel Contributions
- Applies Mixture of Softmax (MoS) to the baseline 9x512 architecture.
- Uses low-rank factorization with rank 64 to keep parameter overhead minimal.
- Demonstrates that MoS adds negligible artifact overhead while remaining within the 16MB budget.
- Reports minimal quantization degradation after int8+zlib roundtrip.
- Explores the theoretical benefit of lifting the softmax rank limit from d+1 to K*d+? for the full vocabulary dimensionality.