PR #570

open

(Non record) 11L Frontier MixedQuant Trigram

by armmer016View on GitHub
val_bpb
1.3434
Architecture
Transformer
Optimizer
Artifact Size
19.36MB

Training Techniques

Quantization
mixed int6/int8
bits: null
scope: all
Architecture
TrigramHash Embedding
Embedding using trigram hashing alongside BigramHash to capture triplet context
parameters: null
BigramHash Embedding
Embedding using bigram hashing to capture pairwise context
parameters: null
U-Net Skip Gates
Sigmoid gating connecting encoder and decoder segments
parameters: null
Star-ReLU
Quadratic activation scaling
parameters: null
Other
other
No pruning: exact 0.0 clamping removed to preserve absolute model density
parameters: null

Novel Contributions

  • Scaling up to 11 layers to push network capacity
  • Using TrigramHash embedding alongside BigramHash embedding
  • Introducing U-Net style sigmoid gating between encoder and decoder segments
  • Applying Star-ReLU quadratic activation scaling
  • Demonstrating that pruning is mandatory to meet the 16MB artifact size limit
  • Experimenting with unpruned mixed int6/int8 quantized weights resulting in high entropy and artifact size