PR #580

open

[Non-record] Azure 1xH100 frontier-family engineering run (val_bpb=1.2623)

val_bpb

1.2623

Architecture

Transformer

Optimizer

—

Artifact Size

12.7MB

Training Techniques

Quantization

int8

bits: 8

scope: all

Architecture

BigramHash

Use of BigramHash with vocab size 10240 and dimension 128

parameters: {"BIGRAM_VOCAB_SIZE":10240,"BIGRAM_DIM":128}

MLP3x

MLP multiplier of 3

parameters: {"MLP_MULT":3}

KV head count

Separate number of key-value heads

parameters: {"NUM_KV_HEADS":4,"NUM_HEADS":8}

Weight Averaging

SWA

parameters: null

Regularization

weight decay

parameters: {"weight_decay":0.04}

Sequence Length

sequence_length

train_length: 2048

eval_length: 1024

Compression

zlib

level: null

First verified single-GPU 1x NVIDIA H100 NVL 94GB frontier-family run reaching low 1.2x BPB regime
Engineering artifact bridging proxy-only T4 work and future 8xH100 submission-grade runs
Use of exact int8+zlib roundtrip compression to fit under 16MB artifact cap
Longer engineering wallclock cap (1800s) for telemetry and validation
Transparent publication despite trailing eval interruption by SIGTERM
Use of BigramHash with large vocab size and MLP multiplier 3 in architecture