In this article, a storage format for sparse matrices, called compressed sparse blocks (CSB), is introduced, which allows both Ax and A,x to be computed efficiently in parallel, where A is an n×n sparse matrix with nnzen nonzeros and x is a dense n-vector.
Abstract:
This paper introduces a storage format for sparse matrices, called compressed sparse blocks (CSB), which allows both Ax and A,x to be computed efficiently in parallel, where A is an n×n sparse matrix with nnzen nonzeros and x is a dense n-vector. Our algorithms use Θ(nnz) work (serial running time) and Θ(√nlgn) span (critical-path length), yielding a parallelism of Θ(nnz/√nlgn), which is amply high for virtually any large matrix. The storage requirement for CSB is the same as that for the more-standard compressed-sparse-rows (CSR) format, for which computing Ax in parallel is easy but A,x is difficult. Benchmark results indicate that on one processor, the CSB algorithms for Ax and A,x run just as fast as the CSR algorithm for Ax, but the CSB algorithms also scale up linearly with processors until limited by off-chip memory bandwidth.
TL;DR: The parallel Combinatorial BLAS is described, which consists of a small but powerful set of linear algebra primitives specifically targeting graph and data mining applications, and an extensible library interface and some guiding principles for future development are provided.
TL;DR: TACO as mentioned in this paper is a C++ library that automatically generates compound tensor algebra operations on dense and sparse tensors, which can be used in machine learning, data analytics, engineering and the physical sciences.
TL;DR: CSR5 (Compressed Sparse Row 5), a new storage format, which offers high-throughput SpMV on various platforms including CPUs, GPUs and Xeon Phi, is proposed for real-world applications such as a solver with only tens of iterations because of its low-overhead for format conversion.
TL;DR: A two pronged approach for efficient data reorganization is presented, which combines a proposed DRAM-aware reshape accelerator integrated within 3D-stacked DRAM, and a mathematical framework that is used to represent and optimize the reorganization operations.
TL;DR: In this article, the performance of the Xeon Phi coprocessor for sparse linear algebra kernels is investigated and the important hardware details and show that Xeon Phi's sparse kernel performance is very promising and even better than that of cutting-edge CPUs and GPUs.
TL;DR: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures and presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers.
TL;DR: This chapter discusses methods related to the normal equations of linear algebra, and some of the techniques used in this chapter were derived from previous chapters of this book.
TL;DR: Bjarne Stroustrup makes C even more accessible to those new to the language, while adding advanced information and techniques that even expert C programmers will find invaluable.
TL;DR: In this article, a language similar to logo is used to draw geometric pictures using this language and programs are developed to draw geometrical pictures using it, which is similar to the one we use in this paper.
Q1. What have the authors contributed in "Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks" ?
This paper introduces a storage format for sparse matrices, called compressed sparse blocks ( CSB ), which allows both Ax and ATx to be computed efficiently in parallel, where A is an n× n sparse matrix with nnz ≥ n nonzeros and x is a dense n-vector.
Q2. How many bits are required for each element in val?
For each element in val, the authors use lgβ bits to represent the row index and lgβ bits to represent the column index, requiring a total of nnz lgβ bits for each of row_ind and col_ind.
Q3. How many nonzeros can be distributed in parallel?
If the nonzeros were guaranteed to be distributed evenly among block rows, then the simple blockrow parallelism would yield an efficient algorithm with n/β-way parallelism by simply performing a serial multiplication for each blockrow.
Q4. What is the Z-Morton ordering on nonzeros in each block?
The Z-Morton ordering on nonzeros in each block is equivalent to first interleaving the bits of row_ind and col_ind, and then sorting the nonzeros using these bit-interleaved values as the keys.
Q5. What is the current standard storage format for sparse matrices in scientific computing?
The current standard storage format for sparse matrices in scientific computing, compressed sparse rows (CSR) [32], is more efficient, because it stores only n + nnz indices or pointers.
Q6. What is the format for storing the nonzeros of each matrix row?
The compressed sparse row (CSR) format stores the nonzeros (and ideally only the nonzeros) of each matrix row in consecutive memory locations, and it stores an index to the first stored element of each row.
Q7. How do you get the mean/max values for CSB?
For CSB, the reported mean/max values are obtained by setting the block dimension β to be approximately √ n, so that they are comparable with statistics from CSC.
Q8. What is the CSB constructor considered to have balanced blockrows?
In other words, if max(nnz(Ai)) < 2 ·mean(nnz(Ai)), then the matrix is considered to have balanced blockrows and the optimization is applied.
Q9. What is the CSB constructor's order of the bitmasks?
The bitmasks are determined dynamically by the CSB constructor depending on the input matrix and the data type used for storing matrix indices.
Q10. What is the cost of converting to and from bit-interleaved integers?
Converting to and from bit-interleaved integers, however, is expensive with current hardware support,6 which would be necessary for the serial base case in lines 29–32.
Q11. What is the level of parallelization required to avoid races?
This level of parallelization requires care to avoid races, however, because two blocks in the same blockrow write to the same region within the output vector.
Q12. What is the order of the indices in the val array?
These indices are relative to the block containing the particular element, not the entire matrix, and hence they range from 0 to β−1.
Q13. What is the space of a work-stealing scheduler?
Although not all work-stealing schedulers are space efficient, those maintaining the busy-leaves property [5] (e.g., as used in the Cilk work-stealing scheduler [4]) are space efficient.