val_bpb
1.3486
Architecture
Transformer
Optimizer
—
Artifact Size
14,698,858 bytes
Training Techniques
Architecture
KV head count
Baseline architecture kept fixed with 9 layers, 512 model dim, 8 attention heads, and 4 KV heads.
parameters: {"layers":9,"model_dim":512,"num_heads":8,"num_kv_heads":4,"mlp_mult":2}
LR Schedule
warmdown
parameters: {"warmdown_iters":100}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Novel Contributions
- Adjusted scheduler warmdown behavior for a 1-GPU step-time regime
- Used a shorter warmdown period (WARMDOWN_ITERS=100) so learning rate does not decay too early on slower 1xH100 runs
- Improved same-session 1xH100 baseline validation bpb while staying within the 16MB cap
- Kept the baseline 9x512 sp1024 architecture and data pipeline fixed