PR #580
open[Non-record] Azure 1xH100 frontier-family engineering run (val_bpb=1.2623)
by micoverdeView on GitHub
val_bpb
1.2623
Architecture
Transformer
Optimizer
—
Artifact Size
12.7MB
Training Techniques
Quantization
int8
bits: 8
scope: all
Architecture
BigramHash
Use of BigramHash with vocab size 10240 and dimension 128
parameters: {"BIGRAM_VOCAB_SIZE":10240,"BIGRAM_DIM":128}
MLP3x
MLP multiplier of 3
parameters: {"MLP_MULT":3}
KV head count
Separate number of key-value heads
parameters: {"NUM_KV_HEADS":4,"NUM_HEADS":8}
Weight Averaging
SWA
parameters: null
Regularization
weight decay
parameters: {"weight_decay":0.04}
Sequence Length
sequence_length
train_length: 2048
eval_length: 1024
Compression
zlib
level: null
Novel Contributions
- First verified single-GPU 1x NVIDIA H100 NVL 94GB frontier-family run reaching low 1.2x BPB regime
- Engineering artifact bridging proxy-only T4 work and future 8xH100 submission-grade runs
- Use of exact int8+zlib roundtrip compression to fit under 16MB artifact cap
- Longer engineering wallclock cap (1800s) for telemetry and validation
- Transparent publication despite trailing eval interruption by SIGTERM
- Use of BigramHash with large vocab size and MLP multiplier 3 in architecture