PR #41

closed

Add Modal 8xH100 timed validation non-record submission

by kiankyarsView on GitHub
val_bpb
1.2296
Architecture
Transformer
Optimizer
Artifact Size
15,853,604 bytes

Training Techniques

Quantization
int8
bits: 8
scope: all
Architecture
weight tying
Tied output and input embeddings.
parameters: null
KV head count
Used fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Compression
zlib
level: null
Other
other
Timed validation run on Modal 8xH100 with a 600-second wallclock cap and single-node torchrun.
parameters: {"hardware":"8xH100","wallclock_seconds":600,"nproc_per_node":8}

Novel Contributions

  • Added a non-record submission folder for a timed validation run on Modal 8xH100.
  • Preserved the exact train_gpt.py snapshot, train.log, and submission.json used for the run.
  • Documented the 600-second Modal setup and the reason the submission is on the non-record track.
  • Used a published FineWeb sp1024 export staged in a persistent Modal Volume.
  • Submitted a model that fit under the 16,000,000-byte artifact cap but did not set a new leaderboard record.