PR #71

closed

Add Parameter Golf submission: Depth12 Dim416 KV4

val_bpb

1.3509

Architecture

Transformer

Optimizer

—

Artifact Size

14301562 bytes

Training Techniques

Architecture

tied embeddings

Input and output embeddings are tied to reduce parameters and artifact size.

parameters: null

KV head count

Uses fewer key/value heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

depth/narrow transformer

Uses a deeper but narrower Transformer layout compared with the naive baseline.

parameters: {"layers":12,"model_dim":416}

Quantization

int8

bits: 8

scope: model weights

Compression

zlib

level: null

Sequence Length

sequence_length

train_length: 1024

eval_length: null

LR Schedule

warmdown

parameters: {"warmup_steps":20,"warmdown_iters":1200}

Other

other

10-minute wallclock-limited training run on 8xH100 GPUs.

parameters: {"max_wallclock_seconds":600,"num_gpus":8}