PR #94

open

Non-record: Warmdown fix (9x512) on 1xH100 10m

val_bpb

1.3486

Architecture

Transformer

Optimizer

—

Artifact Size

14,698,858 bytes

Training Techniques

Architecture

KV head count

Baseline architecture kept fixed with 9 layers, 512 model dim, 8 attention heads, and 4 KV heads.

parameters: {"layers":9,"model_dim":512,"num_heads":8,"num_kv_heads":4,"mlp_mult":2}

LR Schedule

warmdown

parameters: {"warmdown_iters":100}

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

Adjusted scheduler warmdown behavior for a 1-GPU step-time regime
Used a shorter warmdown period (WARMDOWN_ITERS=100) so learning rate does not decay too early on slower 1xH100 runs
Improved same-session 1xH100 baseline validation bpb while staying within the 16MB cap
Kept the baseline 9x512 sp1024 architecture and data pipeline fixed