Minimizing Communication in Linear Algebra

doi:10.1137/090769156

Open AccessJournal ArticleDOI

Minimizing Communication in Linear Algebra

Grey Ballard, +3 more

- 15 May 2009 -

arXiv: Computational Complexity

Chats0

TLDR

This work generalizes a lower bound on the amount of communication needed to perform dense, n-by-n matrix multiplication using the conventional O(n3) algorithm to a much wider variety of algorithms, including LU factorization, Cholesky factors, LDLT factors, QR factors, the Gram–Schmidt algorithm, and algorithms for eigenvalues and singular values.

Abstract:

In 1981 Hong and Kung [HK81] proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense,n-by-nmatrix-multiplication using the conventionalO(n 3 ) algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin [ITT04] gave a new proof of this result and extended it to the parallel case (where communication means the amount of data moved between processors). In both cases the lower bound may be expressed as !(#arithmetic operations / ! M), whereMis the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization,LDL T factorization, QR factorization, algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices, and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth) we get lower bounds on the number of messages required to move it (latency). We illustrate how to extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph theoretic problems. We point out recently designed algorithms for dense LU, Cholesky, QR, eigenvalue and the SVD problems that attain these lower bounds; implementations of LU and QR show large speedups over conventional linear algebra algorithms in standard libraries like LAPACK and ScaLAPACK. Many open problems remain.

Minimizing Communication in Linear Algebra

Citations

Towards dense linear algebra for hybrid GPU accelerated manycore systems

Communication-optimal parallel and sequential QR and LU factorizations

Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms

Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments

A massively parallel tensor contraction framework for coupled-cluster computations

References

I and J

Matrix computations

Introduction to Algorithms

Iterative Methods for Sparse Linear Systems

Applied Numerical Linear Algebra

Related Papers (5)

Communication lower bounds for distributed-memory matrix multiplication

I/O complexity: The red-blue pebble game

Communication-optimal Parallel and Sequential QR and LU Factorizations

A cellular computer to implement the kalman filter algorithm

SUMMA: Scalable Universal Matrix Multiplication Algorithm