val_bpb
1.2296
Architecture
Transformer
Optimizer
—
Artifact Size
15,853,604 bytes
Training Techniques
Quantization
int8
bits: 8
scope: all
Architecture
weight tying
Tied output and input embeddings.
parameters: null
KV head count
Used fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Compression
zlib
level: null
Other
other
Timed validation run on Modal 8xH100 with a 600-second wallclock cap and single-node torchrun.
parameters: {"hardware":"8xH100","wallclock_seconds":600,"nproc_per_node":8}
Novel Contributions
- Added a non-record submission folder for a timed validation run on Modal 8xH100.
- Preserved the exact train_gpt.py snapshot, train.log, and submission.json used for the run.
- Documented the 600-second Modal setup and the reason the submission is on the non-record track.
- Used a published FineWeb sp1024 export staged in a persistent Modal Volume.
- Submitted a model that fit under the 16,000,000-byte artifact cap but did not set a new leaderboard record.