val_bpb
1.3434
Architecture
Transformer
Optimizer
—
Artifact Size
19.36MB
Training Techniques
Quantization
mixed int6/int8
bits: null
scope: all
Architecture
TrigramHash Embedding
Embedding using trigram hashing alongside BigramHash to capture triplet context
parameters: null
BigramHash Embedding
Embedding using bigram hashing to capture pairwise context
parameters: null
U-Net Skip Gates
Sigmoid gating connecting encoder and decoder segments
parameters: null
Star-ReLU
Quadratic activation scaling
parameters: null
Other
other
No pruning: exact 0.0 clamping removed to preserve absolute model density
parameters: null
Novel Contributions
- Scaling up to 11 layers to push network capacity
- Using TrigramHash embedding alongside BigramHash embedding
- Introducing U-Net style sigmoid gating between encoder and decoder segments
- Applying Star-ReLU quadratic activation scaling
- Demonstrating that pruning is mandatory to meet the 16MB artifact size limit
- Experimenting with unpruned mixed int6/int8 quantized weights resulting in high entropy and artifact size