PR #1136

closed

Trunghiu

by HieuabssyView on GitHub
val_bpb
1.3069
Architecture
Transformer
Optimizer
Artifact Size

Training Techniques

Architecture
LeakyReLU
Uses leaky ReLU activation in the model.
parameters: {"slope":0.5}
ReLU²
Uses squared ReLU activation in the model.
parameters: null
BigramHash
Adds bigram hash embeddings.
parameters: null
Quantization
GPTQ
bits: null
scope: null
Weight Averaging
EMA
parameters: null

Novel Contributions

  • LeakyReLU with slope 0.5
  • ReLU² activation
  • GPTQ quantization
  • EMA weight averaging
  • BigramHash embeddings