PR #1505

open

Non-record: 11L 3x MLP Seq2048 — val_bpb 1.1791 (8xH100 SXM)

by Rohan-AbhilashView on GitHub

val_bpb

1.1791

Architecture

Transformer

Optimizer

—

Artifact Size

24.5MB

Training Techniques

Architecture

MLP3x

Increased MLP width from 2x to 3x with hidden dim 1536.

parameters: {"mlp_multiplier":3,"hidden_dim":1536}

weight tying

Tied input embeddings and output head.

parameters: null

KV head count

Used grouped KV heads in the transformer configuration.

parameters: {"num_heads":8,"num_kv_heads":4}

Transformer

Scaled baseline transformer to 11 layers.

parameters: {"layers":11}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_steps":2000}

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null