val_bpb
1.3069
Architecture
Transformer
Optimizer
—
Artifact Size
—
Training Techniques
Architecture
LeakyReLU
Uses leaky ReLU activation in the model.
parameters: {"slope":0.5}
ReLU²
Uses squared ReLU activation in the model.
parameters: null
BigramHash
Adds bigram hash embeddings.
parameters: null
Quantization
GPTQ
bits: null
scope: null
Weight Averaging
EMA
parameters: null
Novel Contributions
- LeakyReLU with slope 0.5
- ReLU² activation
- GPTQ quantization
- EMA weight averaging
- BigramHash embeddings