PR #45

closed

Modal 8xH100 LowerLR FP16Embed 960 (val_bpb 1.22395)

by kiankyarsView on GitHub
val_bpb
1.2240
Architecture
Transformer
Optimizer
Artifact Size
15844118 bytes

Training Techniques

Architecture
tied embeddings
Input and output embeddings are tied, with the tied embedding kept at higher precision in the record snapshot.
parameters: null
Other
other
Reduced MLP hidden size to 960 to stay under the 16MB cap.
parameters: {"mlp_hidden":960}
Compression
zlib
level: null

Novel Contributions

  • Uses an 8xH100 Modal single-node torchrun setup with a 600s wallclock cap
  • Keeps tied embeddings at higher precision in the record snapshot
  • Reduces MLP hidden size to 960 to fit under the 16MB submission cap
  • Stores the final artifact as int8+zlib compressed model plus code