PR #1428
openNon-record: 9L SwiGLU MLP2 on 8xH100 (val_bpb 1.2370, 15.9MB)
by ntwari-bruceView on GitHub
val_bpb
1.2370
Architecture
Transformer
Optimizer
—
Artifact Size
15.9MB
Training Techniques
Architecture
SwiGLU
Replaced ReLU² MLP activation with SwiGLU gating, using two projections and SiLU gating.
parameters: {"layers":9,"model_dim":512,"num_heads":8,"num_kv_heads":4,"mlp_mult":2}
GQA
Used grouped query attention with explicit K/V head repetition for PyTorch compatibility.
parameters: {"num_heads":8,"num_kv_heads":4}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Compression
zlib
level: null
Novel Contributions
- Switched the MLP activation from ReLU² to SwiGLU
- Scaled the SwiGLU hidden dimension to 2/3 to preserve parameter count
- Applied a GQA compatibility fix by explicitly repeating K/V heads for older PyTorch versions
- Trained a 9-layer model on 8×H100 under the 10-minute wallclock cap