PR #111

open

Non-record unlimited-compute: 1-hour 1xH100 warmdown 9x512

by aamodbhattView on GitHub
val_bpb
1.2540
Architecture
Transformer
Optimizer
Artifact Size
15,858,552 bytes

Training Techniques

Architecture
KV head count
Uses a fixed Transformer layout with 9 layers, 512 model dimension, 8 attention heads, and 4 KV heads.
parameters: {"layers":9,"model_dim":512,"num_heads":8,"num_kv_heads":4,"mlp_mult":2,"vocab_size":1024}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":100}
Compression
zlib
level: null

Novel Contributions

  • Extended training from a 10-minute run to a 1-hour run on a single H100 GPU.
  • Used warmdown with 100 iterations to concentrate LR decay near the end of training.
  • Kept the baseline 9x512 sp1024 architecture and tokenizer/data pipeline fixed while improving validation bpb.
  • Produced a submission under the 16MB artifact cap using int8 quantization plus zlib compression.