Minimizing Communication in Linear Algebra
Reads0
Chats0
TLDR
This work generalizes a lower bound on the amount of communication needed to perform dense, n-by-n matrix multiplication using the conventional O(n3) algorithm to a much wider variety of algorithms, including LU factorization, Cholesky factors, LDLT factors, QR factors, the Gram–Schmidt algorithm, and algorithms for eigenvalues and singular values.Abstract:
In 1981 Hong and Kung [HK81] proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense,n-by-nmatrix-multiplication using the conventionalO(n 3 ) algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin [ITT04] gave a new proof of this result and extended it to the parallel case (where communication means the amount of data moved between processors). In both cases the lower bound may be expressed as !(#arithmetic operations / ! M), whereMis the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization,LDL T factorization, QR factorization, algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices, and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth) we get lower bounds on the number of messages required to move it (latency). We illustrate how to extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph theoretic problems. We point out recently designed algorithms for dense LU, Cholesky, QR, eigenvalue and the SVD problems that attain these lower bounds; implementations of LU and QR show large speedups over conventional linear algebra algorithms in standard libraries like LAPACK and ScaLAPACK. Many open problems remain.read more
Citations
More filters
Journal ArticleDOI
Towards dense linear algebra for hybrid GPU accelerated manycore systems
TL;DR: The need for new algorithms that would split the computation in a way that would fully exploit the power that each of the hybrid components possesses is motivated, and the need for a DLA library similar to LAPACK but for hybrid manycore/GPU systems is envisioned.
Posted Content
Communication-optimal parallel and sequential QR and LU factorizations
TL;DR: In this article, the authors present parallel and sequential dense QR factorization algorithms that are both optimal (up to polylogarithmic factors) in the amount of communication they perform, and just as stable as Householder QR.
Book ChapterDOI
Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms
Edgar Solomonik,James Demmel +1 more
TL;DR: A novel lower bound on the latency cost of 2.5D and 3D LU factorization is proved, showing that while c copies of the data can be reduced, the latency must increase by a factor of c1/2, so that the 2D LU algorithm (c = 1) in fact minimizes latency.
Journal ArticleDOI
Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments
Aydin Buluc,John R. Gilbert +1 more
TL;DR: It is demonstrated that the parallel SpGEMM methods, which use two-dimensional block data distributions with serial hypersparse kernels, are indeed highly flexible, scalable, and memory-efficient in the general case.
Journal ArticleDOI
A massively parallel tensor contraction framework for coupled-cluster computations
TL;DR: A distributed-memory numerical library (Cyclops Tensor Framework) that automatically manages tensor blocking and redistribution to perform any user-specified contractions and enables the expression of massively-parallel coupled-cluster methods via a concise tensor contraction interface.
References
More filters
Book
Introduction to Algorithms
TL;DR: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures and presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers.
Book
Iterative Methods for Sparse Linear Systems
TL;DR: This chapter discusses methods related to the normal equations of linear algebra, and some of the techniques used in this chapter were derived from previous chapters of this book.
Book
Applied Numerical Linear Algebra
TL;DR: The symmetric Eigenproblem and singular value decomposition and the Iterative methods for linear systems Bibliography Index.