scispace - formally typeset
Open AccessJournal ArticleDOI

Minimizing Communication in Linear Algebra

Reads0
Chats0
TLDR
This work generalizes a lower bound on the amount of communication needed to perform dense, n-by-n matrix multiplication using the conventional O(n3) algorithm to a much wider variety of algorithms, including LU factorization, Cholesky factors, LDLT factors, QR factors, the Gram–Schmidt algorithm, and algorithms for eigenvalues and singular values.
Abstract
In 1981 Hong and Kung [HK81] proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense,n-by-nmatrix-multiplication using the conventionalO(n 3 ) algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin [ITT04] gave a new proof of this result and extended it to the parallel case (where communication means the amount of data moved between processors). In both cases the lower bound may be expressed as !(#arithmetic operations / ! M), whereMis the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization,LDL T factorization, QR factorization, algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices, and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth) we get lower bounds on the number of messages required to move it (latency). We illustrate how to extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph theoretic problems. We point out recently designed algorithms for dense LU, Cholesky, QR, eigenvalue and the SVD problems that attain these lower bounds; implementations of LU and QR show large speedups over conventional linear algebra algorithms in standard libraries like LAPACK and ScaLAPACK. Many open problems remain.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Towards dense linear algebra for hybrid GPU accelerated manycore systems

TL;DR: The need for new algorithms that would split the computation in a way that would fully exploit the power that each of the hybrid components possesses is motivated, and the need for a DLA library similar to LAPACK but for hybrid manycore/GPU systems is envisioned.
Posted Content

Communication-optimal parallel and sequential QR and LU factorizations

TL;DR: In this article, the authors present parallel and sequential dense QR factorization algorithms that are both optimal (up to polylogarithmic factors) in the amount of communication they perform, and just as stable as Householder QR.
Book ChapterDOI

Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms

TL;DR: A novel lower bound on the latency cost of 2.5D and 3D LU factorization is proved, showing that while c copies of the data can be reduced, the latency must increase by a factor of c1/2, so that the 2D LU algorithm (c = 1) in fact minimizes latency.
Journal ArticleDOI

Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments

TL;DR: It is demonstrated that the parallel SpGEMM methods, which use two-dimensional block data distributions with serial hypersparse kernels, are indeed highly flexible, scalable, and memory-efficient in the general case.
Journal ArticleDOI

A massively parallel tensor contraction framework for coupled-cluster computations

TL;DR: A distributed-memory numerical library (Cyclops Tensor Framework) that automatically manages tensor blocking and redistribution to perform any user-specified contractions and enables the expression of massively-parallel coupled-cluster methods via a concise tensor contraction interface.
References
More filters
Book ChapterDOI

I and J

Book

Matrix computations

Gene H. Golub
Book

Introduction to Algorithms

TL;DR: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures and presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers.
Book

Iterative Methods for Sparse Linear Systems

Yousef Saad
TL;DR: This chapter discusses methods related to the normal equations of linear algebra, and some of the techniques used in this chapter were derived from previous chapters of this book.
Book

Applied Numerical Linear Algebra

TL;DR: The symmetric Eigenproblem and singular value decomposition and the Iterative methods for linear systems Bibliography Index.
Related Papers (5)