PR #1608

open

Non-record: 1x H100 SXM5 Explorations

by User123331View on GitHub
val_bpb
1.3921
Architecture
Transformer
Optimizer
Artifact Size
12.8MB

Training Techniques

Architecture
weight tying
Tied embeddings used in the pilot MoS run.
parameters: null
Mixture of Softmax
MoS output head with low-rank factorization.
parameters: {"k":2,"rank":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Evaluation
sliding window eval
parameters: {"stride":64}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Other
other
Mixture of Softmax pilot run on a 1x H100 SXM5 setup with a 10-minute wallclock budget.
parameters: {"wallclock_seconds":600,"hardware":"NVIDIA H100 SXM5 80GB"}

Novel Contributions

  • Pilot run of Mixture of Softmax (MoS) on 1x H100 SXM5
  • Low-rank MoS factorization with K=2 and rank=64
  • Tied embeddings with int8 roundtrip compression
  • Demonstrated minimal quantization degradation after int8+zlib compression