PR #1608

open

Non-record: 1x H100 SXM5 Explorations

val_bpb

1.3921

Architecture

Transformer

Optimizer

—

Artifact Size

12.8MB

Training Techniques

Architecture

weight tying

Tied embeddings used in the pilot MoS run.

parameters: null

Mixture of Softmax

MoS output head with low-rank factorization.

parameters: {"k":2,"rank":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Evaluation

sliding window eval

parameters: {"stride":64}

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

Other

other

Mixture of Softmax pilot run on a 1x H100 SXM5 setup with a 10-minute wallclock budget.

parameters: {"wallclock_seconds":600,"hardware":"NVIDIA H100 SXM5 80GB"}