PR #1393

open

[Submission] SwiGLU MLP (under 16MB)

by Abhinav-AvasaralaView on GitHub
val_bpb
1.4716
Architecture
Transformer
Optimizer
Artifact Size
~13.6MB

Training Techniques

Architecture
SwiGLU
Replaced the baseline ReLU² MLP with a SwiGLU-based MLP.
parameters: {"mlp_mult":1}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Regularization
gradient clipping
parameters: null
Sequence Length
sequence_length
train_length: null
eval_length: null

Novel Contributions

  • Replaced ReLU² MLP with SwiGLU
  • Reduced MLP width to MLP_MULT=1 to fit under the 16MB limit
  • Added gradient clipping for training stability and quantization robustness
  • Used int8 quantization with zlib compression to achieve a ~13.6MB artifact