val_bpb
1.3446
Architecture
Transformer
Optimizer
—
Artifact Size
—
Training Techniques
Architecture
persistent memory
Replaces the feed-forward network in Transformer blocks with persistent memory based on Augmenting Self-attention with Persistent Memory.
parameters: null
low-rank factorization
Factorizes matrices as W = W_d W_u to reduce parameter count for large square matrices.
parameters: null
Quantization
int8
bits: 8
scope: tensors with size > 16384; smaller tensors kept in fp16
Novel Contributions
- Replaces Transformer feed-forward layers with persistent memory
- Applies mixed precision quantization with INT8 for large tensors and FP16 for smaller tensors
- Uses low-rank factorization for routing/matrix parameter reduction