Communications of the ACM (CACM for short, not the best sounding acronym around) is the ACM’s flagship magazine. Started in 1957, CACM is handy for keeping up to date on current research being carried out across all topics of computer science and realworld applications. CACM has had an illustrious past with many influential pieces of work and debates started within its pages. These include Hoare’s presentation of the Quicksort algorithm; Rivest, Shamir and Adleman’s description of the first publickey cryptosystem RSA; and Dijkstra’s famous letter against the use of GOTO. In addition to the print edition, which is released monthly, there is a fantastic website (http://cacm.acm. org/) that showcases not only the most recent edition but all previous CACM articles as well, readable online as well as downloadable as a PDF. In addition, the website lets you browse for articles by subject, a handy feature if you want to focus on a particular topic. CACM is really essential reading. Pretty much guaranteed to contain content that is interesting to anyone, it keeps tabs on the latest in computer science. It is a valuable asset for us students, who tend to delve deep into a particular area of CS and forget everything that is happening around us. — Daniel Gooch U ndergraduate research is like a box of chocolates: You never know what kind of project you will get. That being said, there are still a few things you should know to get the most out of the experience.

Communications of the ACM

Recent workload trends indicate rapid growth in the deployment of machine learning, genomics and scientific workloads on cloud computing infrastructure. However, efficiently running these applications on shared infrastructure is challenging and we find that choosing the right hardware configuration can significantly improve performance and cost. The key to address the above challenge is having the ability to predict performance of applications under various resource configurations so that we can automatically choose the optimal configuration.

Our insight is that a number of jobs have predictable structure in terms of computation and communication. Thus we can build performance models based on the behavior of the job on small samples of data and then predict its performance on larger datasets and cluster sizes. To minimize the time and resources spent in building a model, we use optimal experiment design, a statistical technique that allows us to collect as few training points as required. We have built Ernest, a performance prediction framework for large scale analytics and our evaluation on Amazon EC2 using several workloads shows that our prediction error is low while having a training overhead of less than 5% for long-running jobs.

/pdf/ernest-efficient-performance-prediction-for-large-scale-2fxvolhr85.pdf

Ernest: efficient performance prediction for large-scale advanced analytics

We highlight the trends leading to the increased appeal of using hybrid multicore+GPU systems for high performance computing. We present a set of techniques that can be used to develop efficient dense linear algebra algorithms for these systems. We illustrate the main ideas with the development of a hybrid LU factorization algorithm where we split the computation over a multicore and a graphics processor, and use particular techniques to reduce the amount of pivoting and communication between the hybrid components. This results in an efficient algorithm with balanced use of a multicore processor and a graphics processor.

http://www.netlib.org/utk/people/JackDongarra/PAPERS/lawn210.pdf

Towards dense linear algebra for hybrid GPU accelerated manycore systems

If multicore is a disruptive technology, try to imagine hybrid multicore systems
enhanced with accelerators! This is happening today as accelerators, in particular Graphics
Processing Units (GPUs), are steadily making their way into the high performance computing
(HPC) world. We highlight the trends leading to the idea of hybrid manycore/GPU systems,
and we present a set of techniques that can be used to eciently program them. The presentation
is in the context of Dense Linear Algebra (DLA), a major building block for many
scientic computing applications.We motivate the need for new algorithms that would split the
computation in a way that would fully exploit the power that each of the hybrid components
oers. As the area of hybrid multicore/GPU computing is still in its infancy, we also argue
for its importance in view of what future architectures may look like. We therefore envision
the need for a DLA library similar to LAPACK but for hybrid manycore/GPU systems. We
illustrate the main ideas with an LU-factorization algorithm where particular techniques are
used to reduce the amount of pivoting, resulting in an algorithm achieving up to 388 GFlop/s
for single and up to 99:4 GFlop/s for double precision factorization on a hybrid Intel Xeon
(2x4 cores @ 2.33 GHz) { NVIDIA GeForce GTX 280 5 (240 cores @ 1.30 GHz) system.

Towards Dense Linear Algebra forHybrid GPU Accelerated Manycore Systems

We present parallel and sequential dense QR factorization algorithms that are both optimal (up to polylogarithmic factors) in the amount of communication they perform and just as stable as Householder QR. We prove optimality by deriving new lower bounds for the number of multiplications done by “non-Strassen-like” QR, and using these in known communication lower bounds that are proportional to the number of multiplications. We not only show that our QR algorithms attain these lower bounds (up to polylogarithmic factors), but that existing LAPACK and ScaLAPACK algorithms perform asymptotically more communication. We derive analogous communication lower bounds for LU factorization and point out recent LU algorithms in the literature that attain at least some of these lower bounds. The sequential and parallel QR algorithms for tall and skinny matrices lead to significant speedups in practice over some of the existing algorithms, including LAPACK and ScaLAPACK, for example, up to 6.7 times over ScaLAPACK. A performance model for the parallel algorithm for general rectangular matrices predicts significant speedups over ScaLAPACK.

/pdf/communication-optimal-parallel-and-sequential-qr-and-lu-1ldw6a0wk4.pdf

Communication-optimal Parallel and Sequential QR and LU Factorizations

We present parallel and sequential dense QR factorization algorithms that are both optimal (up to polylogarithmic factors) in the amount of communication they perform, and just as stable as Householder QR. 
We prove optimality by extending known lower bounds on communication bandwidth for sequential and parallel matrix multiplication to provide latency lower bounds, and show these bounds apply to the LU and QR decompositions. We not only show that our QR algorithms attain these lower bounds (up to polylogarithmic factors), but that existing LAPACK and ScaLAPACK algorithms perform asymptotically more communication. We also point out recent LU algorithms in the literature that attain at least some of these lower bounds.

Communication-optimal parallel and sequential QR and LU factorizations

Krylov subspace methods (KSMs) are iterative algorithms for solving large, sparse linear systems and eigenvalue problems. Current KSMs rely on sparse matrix-vector multiply (SpMV) and vector-vector operations (like dot products and vector sums). All of these operations are communication-bound. Furthermore, data dependencies between them mean that only a small amount of that communication can be hidden. Many important scientific and engineering computations spend much of their time in Krylov methods, so the performance of many codes could be improved by introducing KSMs that communicate less.
Our goal is to take s steps of a KSM for the same communication cost as 1 step, which would be optimal. We call the resulting KSMs "communication-avoiding Krylov methods." This thesis makes the following contributions: (1) We have fast kernels replacing SpMV, that can compute the results of s calls to SpMV for the same communication cost as one call (Section 2.1). (2) We have fast dense kernels as well, such as Tall Skinny QR (TSQR – Section 2.3) and Block Gram-Schmidt (BGS – Section 2.4), which can do the work of Modified Gram-Schmidt applied to s vectors for a factor of Θ(s2) fewer messages in parallel, and a factor of Θ(s/W) fewer words transferred between levels of the memory hierarchy (where W is the fast memory capacity in words). (3) We have new communication-avoiding Block Gram-Schmidt algorithms for orthogonalization in more general inner products (Section 2.5). (4) We have new communication-avoiding versions of the following Krylov subspace methods for solving linear systems: the Generalized Minimum Residual method (GMRES – Section 3.4), both unpreconditioned and preconditioned, and the Method of Conjugate Gradients (CG), both unpreconditioned (Section 5.4) and left-preconditioned (Section 5.5). (5) We have new communication-avoiding versions of the following Krylov subspace methods for solving eigenvalue problems, both standard (Ax = λx, for a nonsingular matrix A) and "generalized" (Ax = λMx, for nonsingular matrices A and M): Arnoldi iteration (Section 3.3), and Lanczos iteration, both for Ax = λx (Section 4.2) and Ax = λMx (Section 4.3). (6) We propose techniques for developing communication-avoiding versions of nonsymmetric Lanczos iteration (for solving nonsymmetric eigenvalue problems Ax = λx) and the Method of Biconjugate Gradients (BiCG) for solving linear systems. (7) We can combine more stable numerical formulations that use different bases of Krylov subspaces with our techniques for avoiding communication. For a discussion of different bases, see Chapter 7. To see an example of how the choice of basis affects the formulation of the Krylov method, see Section 3.2.2. (8) We have faster numerical formulations. For example, in our communication-avoiding version of GMRES, CA-GMRES (see Section 3.4), we can pick the restart length r independently of the s-step basis length s. Experiments in Section 3.5.5 show that this ability improves numerical stability. We show in Section 3.6.3 that it also improves performance in practice, resulting in a 2.23× speedup in the CA-GMRES implementation described below. (9) We combine all of these numerical and performance techniques in a shared-memory parallel implementation of our communication-avoiding version of GMRES, CA-GMRES. Compared to a similarly highly optimized version of standard GMRES, when both are running in parallel on 8 cores of an Intel Clovertown (see Appendix A), CA-GMRES achieves 4.3× speedups over standard GMRES on standard sparse test matrices (described in Appendix B.5). When both are running in parallel on 8 cores of an Intel Nehalem (see Appendix A), CA-GMRES achieves 4.1× speedups. See Section 3.6 for performance results and Section 3.5 for corresponding numerical experiments. We first reported performance results for this implementation on the Intel Clovertown platform in Demmel et al. [78]. (10) We have incorporated preconditioning into our methods. Note that we have not yet developed practical communication-avoiding preconditioners; this is future work. We have accomplished the following: (a) We show (in Sections 2.2 and 4.3) what the s-step basis should compute in the preconditioned case for many different types of Krylov methods and s-step bases. We explain why this is hard in Section 4.3. (b) We have identified two different structures that a preconditioner may have, in order to achieve the desired optimal reduction of communication by a factor of s. See Section 2.2 for details. (Abstract shortened by UMI.)

/pdf/communication-avoiding-krylov-subspace-methods-31e7qrej51.pdf

Communication-avoiding krylov subspace methods

Data communication within the memory system of a single processor node and between multiple nodes in a system is the bottleneck in many iterative sparse matrix solvers like CG and GMRES. Here k iterations of a conventional implementation perform k sparse-matrix-vector-multiplications and Ω(k) vector operations like dot products, resulting in communication that grows by a factor of Ω(k) in both the memory and network. By reorganizing the sparse-matrix kernel to compute a set of matrix-vector products at once and reorganizing the rest of the algorithm accordingly, we can perform k iterations by sending O(log P) messages instead of O(k · log P) messages on a parallel machine, and reading the matrix A from DRAM to cache just once, instead of k times on a sequential machine. This reduces communication to the minimum possible. We combine these techniques to form a new variant of GMRES. Our shared-memory implementation on an 8-core Intel Clovertown gets speedups of up to 4.3x over standard GMRES, without sacrificing convergence rate or numerical stability.

/pdf/minimizing-communication-in-sparse-matrix-solvers-4uq6nkmekp.pdf

Minimizing communication in sparse matrix solvers

The performance of sparse iterative solvers is typically limited by sparse matrix-vector multiplication, which is itself limited by memory system and network performance. As the gap between computation and communication speed continues to widen, these traditional sparse methods will suffer. In this paper we focus on an alternative building block for sparse iterative solvers, the "matrix powers kernel" [x, Ax, A2x, ..., Akx], and show that by organizing computations around this kernel, we can achieve near-minimal communication costs. We consider communication very broadly as both network communication in parallel code and memory hierarchy access in sequential code. In particular, we introduce a parallel algorithm for which the number of messages (total latency cost) is independent of the power k, and a sequential algorithm, that reduces both the number and volume of accesses, so that it is independent of k in both latency and bandwidth costs. This is part of a larger project to develop "communication-avoiding Krylov subspace methods," which also addresses the numerical issues associated with these methods. Our algorithms work for general sparse matrices that "partition well". We introduce parallel performance models of matrices arising from 2D and 3D problems and show predicted speedups over a conventional algorithm of up to 7times on a petaflop-scale machine and up to 22times on computation across the grid. Analogous sequential performance models of the same problems predict speedups over a conventional algorithm of up to 10times on an out-of-core implementation, and up to 2.5times when we use our ideas to reduce off-chip latency and bandwidth to DRAM. Finally, we validate the model on an out-of-core sequential implementation and measured a speedup of over 3times, which is close to the predicted speedup.

/pdf/avoiding-communication-in-sparse-matrix-computations-o8hlxdtdj4.pdf

Mark Hoemmen

Papers

Communication-optimal Parallel and Sequential QR and LU Factorizations

Communication-optimal parallel and sequential QR and LU factorizations

Communication-avoiding krylov subspace methods

Minimizing communication in sparse matrix solvers

Avoiding communication in sparse matrix computations