PR #1428

open

Non-record: 9L SwiGLU MLP2 on 8xH100 (val_bpb 1.2370, 15.9MB)

by ntwari-bruceView on GitHub

val_bpb

1.2370

Architecture

Transformer

Optimizer

—

Artifact Size

15.9MB

Training Techniques

Architecture

SwiGLU

Replaced ReLU² MLP activation with SwiGLU gating, using two projections and SiLU gating.

parameters: {"layers":9,"model_dim":512,"num_heads":8,"num_kv_heads":4,"mlp_mult":2}

GQA

Used grouped query attention with explicit K/V head repetition for PyTorch compatibility.

parameters: {"num_heads":8,"num_kv_heads":4}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Compression

zlib

level: null

Switched the MLP activation from ReLU² to SwiGLU
Scaled the SwiGLU hidden dimension to 2/3 to preserve parameter count
Applied a GQA compatibility fix by explicitly repeating K/V heads for older PyTorch versions
Trained a 9-layer model on 8×H100 under the 10-minute wallclock cap