PR #226

open

Submission: Low-Rank All-Attention (1.3446 bpb)

by CRouvroyView on GitHub
val_bpb
1.3446
Architecture
Transformer
Optimizer
Artifact Size

Training Techniques

Architecture
persistent memory
Replaces the feed-forward network in Transformer blocks with persistent memory based on Augmenting Self-attention with Persistent Memory.
parameters: null
low-rank factorization
Factorizes matrices as W = W_d W_u to reduce parameter count for large square matrices.
parameters: null
Quantization
int8
bits: 8
scope: tensors with size > 16384; smaller tensors kept in fp16

Novel Contributions

  • Replaces Transformer feed-forward layers with persistent memory
  • Applies mixed precision quantization with INT8 for large tensors and FP16 for smaller tensors
  • Uses low-rank factorization for routing/matrix parameter reduction