PR #45

closed

Modal 8xH100 LowerLR FP16Embed 960 (val_bpb 1.22395)

val_bpb

1.2240

Architecture

Transformer

Optimizer

—

Artifact Size

15844118 bytes

Training Techniques

Architecture

tied embeddings

Input and output embeddings are tied, with the tied embedding kept at higher precision in the record snapshot.

parameters: null

Other

other

Reduced MLP hidden size to 960 to stay under the 16MB cap.

parameters: {"mlp_hidden":960}

Compression

zlib

level: null