PR #1391

closed

[Submission] SwiGLU MLP (under 16MB)

by Abhinav-AvasaralaView on GitHub

val_bpb

1.4716

Architecture

Transformer

Optimizer

—

Artifact Size

~13.6MB

Training Techniques

Architecture

SwiGLU

Replaced the baseline ReLU² MLP with a SwiGLU-based MLP using SiLU gating.

parameters: {"MLP_MULT":1}

Regularization

gradient clipping

parameters: null

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null