Optimal sparse matrix dense vector multiplication in the I/O-model
read more
Citations
X-Stream: edge-centric graph processing using streaming partitions
Thinking Like a Vertex: A Survey of Vertex-Centric Frameworks for Large-Scale Distributed Graph Processing
Minimizing Communication in Linear Algebra
Minimizing Communication in Numerical Linear Algebra
On the representation and multiplication of hypersparse matrices
References
Gaussian elimination is not optimal
The input/output complexity of sorting and related problems
External memory algorithms and data structures: dealing with massive data
SPARSKIT: A basic tool kit for sparse matrix computations
I/O complexity: The red-blue pebble game
Related Papers (5)
Frequently Asked Questions (12)
Q2. What is the idea used to gain efficiency over plain sorting?
The idea used here to gain efficiency over plain sorting is to add partial sums used for the same output value as soon as they meet (reside in memory simultaneously) during the sorting.
Q3. How many runs are there in the algorithm?
For M > 4B, the number of merging steps until the (average) length of a run is N , i.e., until there are k runs, is O ( logM/B N/(kM) ) .
Q4. What is the way to compute a matrix-vector product?
For matrices stored in column-major layout, any algorithm computing the product of the all ones vector with a sparse matrix can be used (with one additional scan) to compute a matrix-vector product with the same matrix.
Q5. What is the result of Q given by a polynomial q on the input?
Every result of Q is given by a polynomial q on the input and it is equal to the multilinear result p of the computation for an open set C of inputs.
Q6. How many runs can be sorted using the adaptive sorting algorithm?
In memory, each group will form a file consisting of N/k sorted runs, which by the cache-oblivious adaptive sorting algorithm of [2] can be sorted using O ( n/B logM/B N/k ) I/Os, where n is the number of coefficients in the group.
Q7. What is the definition of a memory hierarchy?
The disk access machine (DAM) model is a two-level abstraction of a memory hierarchy, modeling either cache and main memory or main memory and disk.
Q8. How does the algorithm finish phase two?
The algorithm finishes phase two by simply merging (again with online additions) each run into the first, at a total I/O cost of O (kN/B) for phase two.
Q9. What are some techniques used to optimize register and cache use?
Examples of techniques include “register blocking” and “cache blocking,” which are designed to optimize register and cache use, respectively.
Q10. What is the simplest way to estimate the sums of k-regular N?
Theorem 3. Assume an algorithm computes the row sums for all k-regular N×N matrices stored in column major layout in the algebraic I/O-model using only canonical partial results with at most `(k, N) I/Os.
Q11. Why can't a run become longer than N?
Due to the merging, no run can ever become longer than N , as this is the number of output values, so at the start of phase two, the authors have at most k runs of length at most N .
Q12. What is the lower bound for the kN matrices?
for B > 2, M ≥ 4B,and k ≤ N1/ε, 0 < ε < 1, there is the lower bound`(k, N) ≥ min {κ · kN B logM/B N max{k, M} , 1 8 · ε 2− ε kN} .for κ = min {ε 3 , (1−ε)2 2 , 1 16} .