PR #580

open

[Non-record] Azure 1xH100 frontier-family engineering run (val_bpb=1.2623)

by micoverdeView on GitHub
val_bpb
1.2623
Architecture
Transformer
Optimizer
Artifact Size
12.7MB

Training Techniques

Quantization
int8
bits: 8
scope: all
Architecture
BigramHash
Use of BigramHash with vocab size 10240 and dimension 128
parameters: {"BIGRAM_VOCAB_SIZE":10240,"BIGRAM_DIM":128}
MLP3x
MLP multiplier of 3
parameters: {"MLP_MULT":3}
KV head count
Separate number of key-value heads
parameters: {"NUM_KV_HEADS":4,"NUM_HEADS":8}
Weight Averaging
SWA
parameters: null
Regularization
weight decay
parameters: {"weight_decay":0.04}
Sequence Length
sequence_length
train_length: 2048
eval_length: 1024
Compression
zlib
level: null

Novel Contributions

  • First verified single-GPU 1x NVIDIA H100 NVL 94GB frontier-family run reaching low 1.2x BPB regime
  • Engineering artifact bridging proxy-only T4 work and future 8xH100 submission-grade runs
  • Use of exact int8+zlib roundtrip compression to fit under 16MB artifact cap
  • Longer engineering wallclock cap (1800s) for telemetry and validation
  • Transparent publication despite trailing eval interruption by SIGTERM
  • Use of BigramHash with large vocab size and MLP multiplier 3 in architecture