PR #1428

open

Non-record: 9L SwiGLU MLP2 on 8xH100 (val_bpb 1.2370, 15.9MB)

by ntwari-bruceView on GitHub
val_bpb
1.2370
Architecture
Transformer
Optimizer
Artifact Size
15.9MB

Training Techniques

Architecture
SwiGLU
Replaced ReLU² MLP activation with SwiGLU gating, using two projections and SiLU gating.
parameters: {"layers":9,"model_dim":512,"num_heads":8,"num_kv_heads":4,"mlp_mult":2}
GQA
Used grouped query attention with explicit K/V head repetition for PyTorch compatibility.
parameters: {"num_heads":8,"num_kv_heads":4}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Compression
zlib
level: null

Novel Contributions

  • Switched the MLP activation from ReLU² to SwiGLU
  • Scaled the SwiGLU hidden dimension to 2/3 to preserve parameter count
  • Applied a GQA compatibility fix by explicitly repeating K/V heads for older PyTorch versions
  • Trained a 9-layer model on 8×H100 under the 10-minute wallclock cap