val_bpb
1.4716
Architecture
Transformer
Optimizer
—
Artifact Size
~13.6MB
Training Techniques
Architecture
SwiGLU
Replaced the baseline ReLU² MLP with a SwiGLU-based MLP using SiLU gating.
parameters: {"MLP_MULT":1}
Regularization
gradient clipping
parameters: null
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Novel Contributions
- Replaced ReLU² MLP with a SwiGLU-based MLP
- Reduced MLP width to fit under the 16MB limit
- Added gradient clipping for training stability and quantization robustness
- Used int8 quantization with zlib compression to achieve a ~13.6MB artifact