scispace - formally typeset
Open AccessProceedings ArticleDOI

Optimal sparse matrix dense vector multiplication in the I/O-model

TLDR
It is found that most matrices with kN nonzeros require this number of I/Os, even if the program may depend on the structure of the matrix, and this complexity up to a constant factor for large ranges of the parameters.
Abstract
We analyze the problem of sparse-matrix dense-vector multiplication (SpMV) in the I/O-model. The task of SpMV is to compute y := Ax, where A is a sparse N x N matrix and x and y are vectors. Here, sparsity is expressed by the parameter k that states that A has a total of at most kN nonzeros, i.e., an average number of k nonzeros per column. The extreme choices for parameter k are well studied special cases, namely for k=1 permuting and for k=N dense matrix-vector multiplication.We study the worst-case complexity of this computational task, i.e., what is the best possible upper bound on the number of I/Os depending on k and N only. We determine this complexity up to a constant factor for large ranges of the parameters. By our arguments, we find that most matrices with kN nonzeros require this number of I/Os, even if the program may depend on the structure of the matrix. The model of computation for the lower bound is a combination of the I/O-models of Aggarwal and Vitter, and of Hong and Kung.We study two variants of the problem, depending on the memory layout of A.If A is stored in column major layout, SpMV has I/O complexity Θ(min{kNB(1+logM/BNmax{M,k}), kN}) for k ≤ N1-e and any constant 1> e > 0. If the algorithm can choose the memory layout, the I/O complexity of SpMV is Θ(min{kNB(1+logM/BNkM), kN]) for k ≤ 3√N.In the cache oblivious setting with tall cache assumption M ≥ B1+e, the I/O complexity is Ο(kNB(1+logM/BNk)) for A in column major layout.

read more

Content maybe subject to copyright    Report

ETH Library
Optimal sparse matrix dense
vector multiplication in the I/O-
model
Report
Author(s):
Bender, Michael A.; Brodal, Gerth Stolting; Fagerberg, Rolf; Jacob, Riko; Vicari, Elias
Publication date:
2006
Permanent link:
https://doi.org/10.3929/ethz-a-006781087
Rights / license:
In Copyright - Non-Commercial Use Permitted
Originally published in:
Technical Report / ETH Zurich, Department of Computer Science 523
This page was generated automatically upon download from the ETH Zurich Research Collection.
For more information, please consult the Terms of use.

Optimal Sparse Matrix Dense Vector Multiplication in the
I/O-Model
ETH Technical Report 523
Michael A. Bender
Gerth Stølting Brodal
Rolf Fagerberg
Riko Jacob
§
Elias Vicari
§
June 28, 2006
Abstract
We analyze the problem of sparse-matrix dense-vector multiplication (SpMV) in the I/O model. In
the SpMV, the objective is to compute y = Ax, where A is a sparse matrix and x and y are vectors. We
give tight upper and lower bounds on the number of block transfers as a function of the sparsity k, the
number of nonzeros in a column of A.
Parameter k is a knob that bridges the problems of permuting (k = 1) and dense matrix
multiplication (k = N). When the nonzero elements of A are stored in column-major order,
SpMV takes O
min
n
kN
B
1 + log
M/B
N
max{M ,k}
, kN
o”
memory transfers and has a lower bound
of
min
n
κ(ε)
kN
B
1 + log
M/B
N
max{M ,k}
, κ
0
(ε)k N
o”
, for k N
ε
, 0 < ε < 1. If N M
the problem is trivially Θ(N/B). Thus, these bounds are tight. When A’s layout can be opti-
mized, SpMV takes O
min
n
kN
B
1 + log
M/B
N
kM
, kN
o”
memory transfers and has a lower bound
of
min
n
kN
B
1 + log
M/B
N
kM
, kN
o”
memory transfers, for k
3
N. As before, these bounds are
tight.
Department of Computer Science, Stony Brook, NY 11794, USA. E-mail: bender@cs.sunysb.edu. Supported in part by
NSF Grant CCR-0208670.
BRICS, Basic Research in Computer Science (www.brics.dk), funded by the Danish National Research Foundation, Uni-
versity of Aarhus, Arhus, Denmark. gerth@daimi.au.dk. Partially supported by the Danish Research Agency.
Department of Mathematics and Computer Science, University of Southern Denmark, DK-5230 Odense M, Denmark.
Supported in part by the Danish Natural Science Research Council (SNF). E-mail: rolf@imada.sdu.dk.
§
ETH Zurich, Institute of Theoretical Computer Science, 8092 Zurich, Switzerland. E-mail:
{rjacob,vicariel}@inf.ethz.ch.

1 Introduction
Sparse-matrix dense-vector multiplication (SpMV) is one of the core operations in the computational sciences.
The idea of SpMV is compute y = Ax, where A is a sparse matrix (most of its entries are zero) and
x is a vector. Applications abound in scientific computing, computer science, and engineering, including
iterative linear-system solvers, least-squares problems, eigenvalue problems, data mining, and web se arch
(e.g., computing page rank). In these and other applications, the same sparse matrix is used repeatedly;
only the vector x changes.
It has long been known that SpMV is memory limited, typically exhibiting poor cache (and I/O) perfor-
mance. According to [16], untuned code runs at below 10% of machine peak, and tuned code runs with some
improved percentage of machine peak. Untuned code is likely to run even more inefficiently, as memory hier-
archies grow “steeper.” In contrast, dense matrix-vector multiplication (e.g., in dense linear-system solvers)
does not suffer from this memory bottleneck. Because the same matrix is used repeatedly in these SpMV
applications, it is worth the effort to lay out and encode the matrices to optimize performance. Examples
of techniques include “register blocking” and “cache blocking,” which are designed to optimize register and
cache use, respectively. See e .g., [3, 16] for excellent surveys of the dozens of pape rs on this topic; sparse
matrix libraries include [4, 7, 10, 11, 15, 16]. In these papers, the metric is the running time on test instances
and current hardware.
In this paper we analyze the SpMV problem in the disk access machine [1] and cache-oblivious [5] models.
Our objective is to analyze worst-case instances of matrices and to gain asymptotic insight into running
times on current and future hardware. The disk access machine (DAM) model is a two-level abstraction of a
memory hierarchy, modeling either cache and main memory or main memory and disk. The small memory
level has size M, the large level is unbounded, and the block-transfer size is B. The objective is to minimize
the number of block transfers between the two levels. The cache-oblivious (CO) model enables one to reason
about a two-level model but proves results about an unknown, multilevel memory hierarchy. The CO model
is essentially the DAM model, except that the block size B and main memory size M are unknown to the
coder or algorithm designer. The main idea of the CO model is that if it can be proved that some algorithm
performs a nearly optimal number of memory transfers in the DAM model without parameterizing by B
and M, then the algorithm also performs a nearly optimal number of memory transfers on any unknown,
multilevel memory hierarchy.
We give upper and lower bounds on the I/O complexity of the SpMV in both the DAM and CO models.
Our analyses are parameterized by the degree of sparsity of the matrix. Specifically, we let parameter k be
the number of nonzeros, in each column of the N by N matrix. We show how the upper bound depends on
matrix layout and we give lower bounds that apply for all layouts. Our results apply to worse-case instances.
To best of our knowledge, our results represent the first upper and lower I/O bounds for this important
computational problem.
One app e aling aspect of the SpMV with k as the sparsity parameter is that the problem acts as a bridge
between two well studied problems in the DAM model, dense matrix multiplication and permuting. When
k = Θ(N ), the matrix is dense. In this case, matrix multiplication requires Θ(N
2
/B) memory transfers.
This analysis is tight since it matches the scan bound, i.e., the cost to scan the matrix. When k = 1, i.e.,
the matrix is sparse, the SpMV requires Θ
min
n
(N /B) log
M/B
(N /B), N
o
memory transfers [1]. This is
because when k = 1, SpMV is a minor generalization of external-memory permutating.
We now explain this permutation bound and its connection to the SpMV; see, e.g., [14]. Permuta-
tion matrices are a particular kind of sparse matrix for k = 1: the non-zeros are 1’s and there is exactly
one 1 in each row and column. To permute N elements, either sort by final destination, which requires
Θ
(N /B) log
M/B
(N /B)
memory transfers, or put each element in its final destination, which requires
O(N) memory transfers. The permutation cost is the minimum of these two quantities. A counting argu-
ment shows that this bound is tight, and holds for any layout of the permutation matrix and vectors. The
same strategy applies for any sparse matrix with k = 1. For nonzeros that are not 1’s, multiply the elements
of the source vector before permuting. If some column or row has two nonzeros, then one element of the
source vector x may have several final destinations or one element of the target vector y may be composed
1

of se veral elements of x.
Results. In this paper we give upper and lower bounds parameterized by k on the number of memory
transfers to solve the SpMV. Our bounds show how the pe rmutation bound gradually transforms into the
dense matrix-multiplication bound as k increases. Specifically, we prove the following:
We give an upper bound parameterized by k on the cost for the SpMV when the (nonzero) ele-
ments of the matrices are stored in column-major order. Specifically, the cost for the SpMV is
O
min
n
kN
B
1 + log
M/B
N
max{M,k}
, kN
o
.
This bound generalizes the permutation bound above, where the first term measures a generalization
of sorting by destination, and the second term measures moving each element directly to its final
destination.
We also give an upper bound parameterized by k on the cost for the SpMV when the (nonzero)
elements of the matrices can be stored in arbitrary order. The cost for the SpMV now reduces to
O
min
n
kN
B
1 + log
M/B
N
kM
, kN
o
.
We next define a model of computation to prove lower bounds on SpMV.
We give a lower bound parameterized by k on the cost for the SpMV when the nonzero elements of
the m atrices are stored in column-major order. Thus, the cost for the SpMV is
min
κ(ε)
kN
B
1 + log
M/B
N
max{k, M }
, κ
0
(ε)kN

.
This result applies for k N
ε
, 0 < ε < 1 (and the trivial conditions that B > 2, M 4B). This
shows that our algorithm is optimal up to a constant factor.
We conclude with a lower bound parameterized by k on the cost for the SpMV when the nonzero
elements of the matrices can be stored in any order, and, for k > 21, even if the layout of the input
and output vector can be chosen by the algorithm. Thus, the cost for the SpMV is
min
kN
B
1 + log
M/B
N
kM
, kN

for the former case. This result applies for k
3
N (and the trivial conditions that B > 6, M 3B).
Map. This paper is organized as follows: In Section
3 we present upper bounds on the SpMV. We prove
the upper bound for column-major layout, and then for free layout. We conclude with a description of an
upp e r bound in the cache-oblivious model. In Section 2 we des cribe the computational model in which our
lower bounds hold. Section 4 presents our lower bound for column-major layouts. Section 5 presents our
lower bound for free layouts.
We use the established ` = O(f(N, k, M, B)) notation, which here means that there exists a constant c >
0 s uch that ` c · f(N, k, M, B) for all N, k, M, B, unless otherwise stated.
1.1 Roadmap of the Lower Bound
Our lower bound is the main technical contribution of this paper, and we here give a high-level outline of it.
The overall reasoning is to prove that there are many different inputs that all require different program
executions, and then bound the number of essentially different program executions with at most ` I/Os,
thereby yielding a lower bound on `. For this, we first normalize and simplify the program execution. Then,
we bound the number of program executions by encoding the program with a short string over a small
alphab e t.
In general (for upper and lower bounds), we consider N ×N matrices with kN non-zero entries. For the
lower bound, we in particular focus on a special case of sparse matrices, namely the k-regular matrices,
2

characterized by exactly k non-zero entries in each column. We define the conformation of a matrix as the
locations of its non-zero entries.
We observe that in the machine model we use (defined in Section 2), we may assume that the algorithm
is normalized in the sense that only ”canonical” intermediate results c an be produced. This again allows us
to uniquely assign to every intermediate result either the one variable x
j
it is representing, or the result c
i
it contributes to. Using this, we then define three traces of the computation. These are the compact
encodings describing the actions of the normalized algorithm. There is a time-forward trace of the movement
of the x
j
, and a time-backward trace of the movement of partial results. Additionally, based on these
two traces describing the movements, there is a third trace that gives a compact representation of the
algebraic operations. We show that these traces uniquely determine the normalized algorithm, and hence
the c onformation of the matrix it is computing.
Finally, we calculate the number of different traces of computations with ` I/Os, and we compare it to
the number of different k-regular conformations of N × N -matrices. This shows in particular that there are
many sparse matrices that require many I/Os.
2 Model o f Computation
Our aim is to analyze the I/O cost of computing a matrix-vector product. I/Os are generated by movement
of data, so our real object of study is the dataflow of matrix-vector product algorithms, and the interaction
of this dataflow with the memory hierarchy. Hence, we need a model of computation defining the allowed
forms of intermediate results and their possible movements in the mem ory hierarchy.
In this section, we define our model. Intuitively, it contains all algorithms which compute the values
c
i
=
P
j
a
ij
x
j
of the output vector through multiplications and additions, starting from the coefficients and
variables of the input matrix and vector. The model encompasses all proposed algorithms that we are aware
of. We discuss the model further in Section 2.1.
Our model is based on the notion of a commutative semiring S, i.e., a set of numbers with addition
and multiplication, where operations are as sume d to be associative and commutative, and are distributive.
There is a neutral element 0 for addition, 1 for multiplication, and multiplication with 0 yields 0. In contrast
to a field, there are no inverse elements guaranteed, neither for addition nor for multiplication. One well
investigated example of such a semiring (ac tually having multiplicative inverses) is the max-plus algebra
(tropical algebra), where matrix multiplication can be used to for example compute shortest paths with
negative edge length. Another semiring obviously is R with usual addition and multiplication. The semiring
model is established in the context of matrix multiplication, see Section 2.1.
We now define our model, which we call the semiring I/O-machine. It has an infinite size disk D,
organized in tracks of B numbers each, and main memory containing M numbers. Accordingly, a con-
figuration can be described by a vector of M numbers M = (m
1
, . . . , m
M
), and an infinite sequence D of
tracks modeled by vectors t
i
S
B
. A step of the computation leads to a new configuration according to the
following allowed operations:
Computation on numbers in main memory: algebraic evaluation m
i
:= m
j
+ m
k
, m
i
:= m
j
× m
k
,
setting m
i
:= 0, setting m
i
:= 1, and assigning m
i
:= m
j
,
Input operations, each of which moves an arbitrary track of the disk into the first B cells of memory,
(m
1
, . . . , m
B
) := t
i
, t
i
:= 0.
Output operations, each of which copies the first B cells of me mory to an arbitrary track, t
i
:=
(m
1
, . . . , m
B
) and assume t
i
= 0.
Input and output operations are collectively called I/O operations. Note that in a input operation we move
a track. This may cause at most a factor two loss with respect to a copy operation, but it is important as it
preserves a time symmetry, as it will be clear later.
A program is a finite se quence of operations allowed in the model, and an algorithm is a set of programs.
For the sparse matrix-vector multiplication, we allow the algorithm to choose the program based on N and
3

Citations
More filters
Proceedings ArticleDOI

X-Stream: edge-centric graph processing using streaming partitions

TL;DR: X-Stream is novel in using an edge-centric rather than a vertex-centric implementation of this model, and streaming completely unordered edge lists rather than performing random access, and competes favorably with existing systems for graph processing.
Journal ArticleDOI

Thinking Like a Vertex: A Survey of Vertex-Centric Frameworks for Large-Scale Distributed Graph Processing

TL;DR: In this survey, the vertex-centric approach to graph processing is overviewed, TLAV frameworks are deconstructed into four main components and respectively analyzed, and TLAV implementations are reviewed and categorized.
Journal ArticleDOI

Minimizing Communication in Linear Algebra

TL;DR: This work generalizes a lower bound on the amount of communication needed to perform dense, n-by-n matrix multiplication using the conventional O(n3) algorithm to a much wider variety of algorithms, including LU factorization, Cholesky factors, LDLT factors, QR factors, the Gram–Schmidt algorithm, and algorithms for eigenvalues and singular values.
Journal ArticleDOI

Minimizing Communication in Numerical Linear Algebra

TL;DR: Hong and Kung as discussed by the authors gave a lower bound on the communication complexity of matrix multiplication in the parallel case. But this lower bound was later extended to a much wider variety of linear algebra algorithms, including LU factorization, Cholesky factorization and LDLT factorization.
Proceedings ArticleDOI

On the representation and multiplication of hypersparse matrices

TL;DR: This paper develops and analyzes two new algorithms that scale significantly better than existing kernels on the multiplication of sparse matrices (SpGEMM) and considers their algorithms first as the sequential kernel of a scalable parallel sparse matrix multiplication algorithm and second as part of a polyalgorithm that would execute different kernels depending on the sparsity of the input matrices.
References
More filters
Journal ArticleDOI

Gaussian elimination is not optimal

TL;DR: In this paper, Cook et al. gave an algorithm which computes the coefficients of the product of two square matrices A and B of order n with less than 4. 7 n l°g 7 arithmetical operations (all logarithms in this paper are for base 2).
Journal ArticleDOI

The input/output complexity of sorting and related problems

TL;DR: Tight upper and lower bounds are provided for the number of inputs and outputs (I/OS) between internal memory and secondary storage required for five sorting-related problems: sorting, the fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition.
Journal ArticleDOI

External memory algorithms and data structures: dealing with massive data

TL;DR: The state of the art in the design and analysis of external memory algorithms and data structures, where the goal is to exploit locality in order to reduce the I/O costs is surveyed.

SPARSKIT: A basic tool kit for sparse matrix computations

TL;DR: The main features of a tool package for manipulating and working with sparse matrices, to provide basic tools to facilitate the exchange of software and data between researchers in sparse matrix computations, are presented.
Proceedings ArticleDOI

I/O complexity: The red-blue pebble game

TL;DR: Using the red-blue pebble game formulation, a number of lower bound results for the I/O requirement are proven and may provide insight into the difficult task of balancing I/o and computation in special-purpose system designs.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What contributions have the authors mentioned in the paper "Optimal sparse matrix dense vector multiplication in the i/o- model" ?

The authors analyze the problem of sparse-matrix dense-vector multiplication ( SpMV ) in the I/O model. 

The idea used here to gain efficiency over plain sorting is to add partial sums used for the same output value as soon as they meet (reside in memory simultaneously) during the sorting. 

For M > 4B, the number of merging steps until the (average) length of a run is N , i.e., until there are k runs, is O ( logM/B N/(kM) ) . 

For matrices stored in column-major layout, any algorithm computing the product of the all ones vector with a sparse matrix can be used (with one additional scan) to compute a matrix-vector product with the same matrix. 

Every result of Q is given by a polynomial q on the input and it is equal to the multilinear result p of the computation for an open set C of inputs. 

In memory, each group will form a file consisting of N/k sorted runs, which by the cache-oblivious adaptive sorting algorithm of [2] can be sorted using O ( n/B logM/B N/k ) I/Os, where n is the number of coefficients in the group. 

The disk access machine (DAM) model is a two-level abstraction of a memory hierarchy, modeling either cache and main memory or main memory and disk. 

The algorithm finishes phase two by simply merging (again with online additions) each run into the first, at a total I/O cost of O (kN/B) for phase two. 

Examples of techniques include “register blocking” and “cache blocking,” which are designed to optimize register and cache use, respectively. 

Theorem 3. Assume an algorithm computes the row sums for all k-regular N×N matrices stored in column major layout in the algebraic I/O-model using only canonical partial results with at most `(k, N) I/Os. 

Due to the merging, no run can ever become longer than N , as this is the number of output values, so at the start of phase two, the authors have at most k runs of length at most N . 

for B > 2, M ≥ 4B,and k ≤ N1/ε, 0 < ε < 1, there is the lower bound`(k, N) ≥ min {κ · kN B logM/B N max{k, M} , 1 8 · ε 2− ε kN} .for κ = min {ε 3 , (1−ε)2 2 , 1 16} .