PR #41

closed

Add Modal 8xH100 timed validation non-record submission

val_bpb

1.2296

Architecture

Transformer

Optimizer

—

Artifact Size

15,853,604 bytes

Training Techniques

Quantization

int8

bits: 8

scope: all

Architecture

weight tying

Tied output and input embeddings.

parameters: null

KV head count

Used fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Compression

zlib

level: null

Other

other

Timed validation run on Modal 8xH100 with a 600-second wallclock cap and single-node torchrun.

parameters: {"hardware":"8xH100","wallclock_seconds":600,"nproc_per_node":8}

Added a non-record submission folder for a timed validation run on Modal 8xH100.
Preserved the exact train_gpt.py snapshot, train.log, and submission.json used for the run.
Documented the 600-second Modal setup and the reason the submission is on the non-record track.
Used a published FineWeb sp1024 export staged in a persistent Modal Volume.
Submitted a model that fit under the 16,000,000-byte artifact cap but did not set a new leaderboard record.