PR #226

open

Submission: Low-Rank All-Attention (1.3446 bpb)

val_bpb

1.3446

Architecture

Transformer

Optimizer

—

Artifact Size

—

Training Techniques

Architecture

persistent memory

Replaces the feed-forward network in Transformer blocks with persistent memory based on Augmenting Self-attention with Persistent Memory.

parameters: null

low-rank factorization

Factorizes matrices as W = W_d W_u to reduce parameter count for large square matrices.

parameters: null

Quantization

int8

bits: 8

scope: tensors with size > 16384; smaller tensors kept in fp16

Replaces Transformer feed-forward layers with persistent memory
Applies mixed precision quantization with INT8 for large tensors and FP16 for smaller tensors
Uses low-rank factorization for routing/matrix parameter reduction