What is the idea used to gain efficiency over plain sorting?

The idea used here to gain efficiency over plain sorting is to add partial sums used for the same output value as soon as they meet (reside in memory simultaneously) during the sorting.

How many runs are there in the algorithm?

For M > 4B, the number of merging steps until the (average) length of a run is N , i.e., until there are k runs, is O ( logM/B N/(kM) ) .

What is the way to compute a matrix-vector product?

For matrices stored in column-major layout, any algorithm computing the product of the all ones vector with a sparse matrix can be used (with one additional scan) to compute a matrix-vector product with the same matrix.

What is the result of Q given by a polynomial q on the input?

Every result of Q is given by a polynomial q on the input and it is equal to the multilinear result p of the computation for an open set C of inputs.

How many runs can be sorted using the adaptive sorting algorithm?

In memory, each group will form a file consisting of N/k sorted runs, which by the cache-oblivious adaptive sorting algorithm of [2] can be sorted using O ( n/B logM/B N/k ) I/Os, where n is the number of coefficients in the group.

How does the algorithm finish phase two?

The algorithm finishes phase two by simply merging (again with online additions) each run into the first, at a total I/O cost of O (kN/B) for phase two.

What is the simplest way to estimate the sums of k-regular N?

Theorem 3. Assume an algorithm computes the row sums for all k-regular N×N matrices stored in column major layout in the algebraic I/O-model using only canonical partial results with at most `(k, N) I/Os.

Why can't a run become longer than N?

Due to the merging, no run can ever become longer than N , as this is the number of output values, so at the start of phase two, the authors have at most k runs of length at most N .

What is the lower bound for the kN matrices?

for B > 2, M ≥ 4B,and k ≤ N1/ε, 0 < ε < 1, there is the lower bound`(k, N) ≥ min {κ · kN B logM/B N max{k, M} , 1 8 · ε 2− ε kN} .for κ = min {ε 3 , (1−ε)2 2 , 1 16} .

(Open Access) Optimal sparse matrix dense vector multiplication in the I/O-model (2007) | Michael A. Bender

Q: What contributions have the authors mentioned in the paper "Optimal sparse matrix dense vector multiplication in the i/o- model" ?

The authors analyze the problem of sparse-matrix dense-vector multiplication ( SpMV ) in the I/O model.

Q: What is the definition of a memory hierarchy?

The disk access machine (DAM) model is a two-level abstraction of a memory hierarchy, modeling either cache and main memory or main memory and disk.

Q: What are some techniques used to optimize register and cache use?

Examples of techniques include “register blocking” and “cache blocking,” which are designed to optimize register and cache use, respectively.

ETH Library

Optimal sparse matrix dense

vector multiplication in the I/O-

model

Report

Author(s):

Bender, Michael A.; Brodal, Gerth Stolting; Fagerberg, Rolf; Jacob, Riko; Vicari, Elias

Publication date:

2006

Permanent link:

https://doi.org/10.3929/ethz-a-006781087

Rights / license:

In Copyright - Non-Commercial Use Permitted

Originally published in:

Technical Report / ETH Zurich, Department of Computer Science 523

This page was generated automatically upon download from the ETH Zurich Research Collection.

For more information, please consult the Terms of use.

Optimal Sparse Matrix Dense Vector Multiplication in the

I/O-Model

ETH Technical Report 523

Michael A. Bender

∗

Gerth Stølting Brodal

†

Rolf Fagerberg

‡

Riko Jacob

Elias Vicari

June 28, 2006

Abstract

We analyze the problem of sparse-matrix dense-vector multiplication (SpMV) in the I/O model. In

the SpMV, the objective is to compute y = Ax, where A is a sparse matrix and x and y are vectors. We

give tight upper and lower bounds on the number of block transfers as a function of the sparsity k, the

number of nonzeros in a column of A.

Parameter k is a knob that bridges the problems of permuting (k = 1) and dense matrix

multiplication (k = N). When the nonzero elements of A are stored in column-major order,

SpMV takes O

“

min

“

1 + log

M/B

max{M ,k}

”

, kN

o”

memory transfers and has a lower bound

of Ω

“

min

κ(ε)

“

1 + log

M/B

max{M ,k}

”

, κ

(ε)k N

o”

, for k ≤ N

, 0 < ε < 1. If N ≤ M

the problem is trivially Θ(N/B). Thus, these bounds are tight. When A’s layout can be opti-

mized, SpMV takes O

“

min

“

1 + log

M/B

”

, kN

o”

memory transfers and has a lower bound

of Ω

“

min

“

1 + log

M/B

”

, kN

o”

memory transfers, for k ≤

√

N. As before, these bounds are

tight.

∗

Department of Computer Science, Stony Brook, NY 11794, USA. E-mail: bender@cs.sunysb.edu. Supported in part by

NSF Grant CCR-0208670.

†

BRICS, Basic Research in Computer Science (www.brics.dk), funded by the Danish National Research Foundation, Uni-

versity of Aarhus, Arhus, Denmark. gerth@daimi.au.dk. Partially supported by the Danish Research Agency.

‡

Department of Mathematics and Computer Science, University of Southern Denmark, DK-5230 Odense M, Denmark.

Supported in part by the Danish Natural Science Research Council (SNF). E-mail: rolf@imada.sdu.dk.

ETH Zurich, Institute of Theoretical Computer Science, 8092 Zurich, Switzerland. E-mail:

{rjacob,vicariel}@inf.ethz.ch.

1 Introduction

Sparse-matrix dense-vector multiplication (SpMV) is one of the core operations in the computational sciences.

The idea of SpMV is compute y = Ax, where A is a sparse matrix (most of its entries are zero) and

x is a vector. Applications abound in scientiﬁc computing, computer science, and engineering, including

iterative linear-system solvers, least-squares problems, eigenvalue problems, data mining, and web se arch

(e.g., computing page rank). In these and other applications, the same sparse matrix is used repeatedly;

only the vector x changes.

It has long been known that SpMV is memory limited, typically exhibiting poor cache (and I/O) perfor-

mance. According to [16], untuned code runs at below 10% of machine peak, and tuned code runs with some

improved percentage of machine peak. Untuned code is likely to run even more ineﬃciently, as memory hier-

archies grow “steeper.” In contrast, dense matrix-vector multiplication (e.g., in dense linear-system solvers)

does not suﬀer from this memory bottleneck. Because the same matrix is used repeatedly in these SpMV

applications, it is worth the eﬀort to lay out and encode the matrices to optimize performance. Examples

of techniques include “register blocking” and “cache blocking,” which are designed to optimize register and

cache use, respectively. See e .g., [3, 16] for excellent surveys of the dozens of pape rs on this topic; sparse

matrix libraries include [4, 7, 10, 11, 15, 16]. In these papers, the metric is the running time on test instances

and current hardware.

In this paper we analyze the SpMV problem in the disk access machine [1] and cache-oblivious [5] models.

Our objective is to analyze worst-case instances of matrices and to gain asymptotic insight into running

times on current and future hardware. The disk access machine (DAM) model is a two-level abstraction of a

memory hierarchy, modeling either cache and main memory or main memory and disk. The small memory

level has size M, the large level is unbounded, and the block-transfer size is B. The objective is to minimize

the number of block transfers between the two levels. The cache-oblivious (CO) model enables one to reason

about a two-level model but proves results about an unknown, multilevel memory hierarchy. The CO model

is essentially the DAM model, except that the block size B and main memory size M are unknown to the

coder or algorithm designer. The main idea of the CO model is that if it can be proved that some algorithm

performs a nearly optimal number of memory transfers in the DAM model without parameterizing by B

and M, then the algorithm also performs a nearly optimal number of memory transfers on any unknown,

multilevel memory hierarchy.

We give upper and lower bounds on the I/O complexity of the SpMV in both the DAM and CO models.

Our analyses are parameterized by the degree of sparsity of the matrix. Speciﬁcally, we let parameter k be

the number of nonzeros, in each column of the N by N matrix. We show how the upper bound depends on

matrix layout and we give lower bounds that apply for all layouts. Our results apply to worse-case instances.

To best of our knowledge, our results represent the ﬁrst upper and lower I/O bounds for this important

computational problem.

One app e aling aspect of the SpMV with k as the sparsity parameter is that the problem acts as a bridge

between two well studied problems in the DAM model, dense matrix multiplication and permuting. When

k = Θ(N ), the matrix is dense. In this case, matrix multiplication requires Θ(N

/B) memory transfers.

This analysis is tight since it matches the scan bound, i.e., the cost to scan the matrix. When k = 1, i.e.,

the matrix is sparse, the SpMV requires Θ



min

(N /B) log

M/B

(N /B), N

o

memory transfers [1]. This is

because when k = 1, SpMV is a minor generalization of external-memory permutating.

We now explain this permutation bound and its connection to the SpMV; see, e.g., [14]. Permuta-

tion matrices are a particular kind of sparse matrix for k = 1: the non-zeros are 1’s and there is exactly

one 1 in each row and column. To permute N elements, either sort by ﬁnal destination, which requires



(N /B) log

M/B

(N /B)



memory transfers, or put each element in its ﬁnal destination, which requires

O(N) memory transfers. The permutation cost is the minimum of these two quantities. A counting argu-

ment shows that this bound is tight, and holds for any layout of the permutation matrix and vectors. The

same strategy applies for any sparse matrix with k = 1. For nonzeros that are not 1’s, multiply the elements

of the source vector before permuting. If some column or row has two nonzeros, then one element of the

source vector x may have several ﬁnal destinations or one element of the target vector y may be composed

of se veral elements of x.

Results. In this paper we give upper and lower bounds parameterized by k on the number of memory

transfers to solve the SpMV. Our bounds show how the pe rmutation bound gradually transforms into the

dense matrix-multiplication bound as k increases. Speciﬁcally, we prove the following:

• We give an upper bound parameterized by k on the cost for the SpMV when the (nonzero) ele-

ments of the matrices are stored in column-major order. Speciﬁcally, the cost for the SpMV is



min



1 + log

M/B

max{M,k}



, kN

o

This bound generalizes the permutation bound above, where the ﬁrst term measures a generalization

of sorting by destination, and the second term measures moving each element directly to its ﬁnal

destination.

• We also give an upper bound parameterized by k on the cost for the SpMV when the (nonzero)

elements of the matrices can be stored in arbitrary order. The cost for the SpMV now reduces to



min



1 + log

M/B



, kN

o

• We next deﬁne a model of computation to prove lower bounds on SpMV.

• We give a lower bound parameterized by k on the cost for the SpMV when the nonzero elements of

the m atrices are stored in column-major order. Thus, the cost for the SpMV is

Ω



min



κ(ε)



1 + log

M/B

max{k, M }



, κ

(ε)kN



This result applies for k ≤ N

, 0 < ε < 1 (and the trivial conditions that B > 2, M ≥ 4B). This

shows that our algorithm is optimal up to a constant factor.

• We conclude with a lower bound parameterized by k on the cost for the SpMV when the nonzero

elements of the matrices can be stored in any order, and, for k > 21, even if the layout of the input

and output vector can be chosen by the algorithm. Thus, the cost for the SpMV is

Ω



min





1 + log

M/B



, kN



for the former case. This result applies for k ≤

√

N (and the trivial conditions that B > 6, M ≥ 3B).

Map. This paper is organized as follows: In Section

3 we present upper bounds on the SpMV. We prove

the upper bound for column-major layout, and then for free layout. We conclude with a description of an

upp e r bound in the cache-oblivious model. In Section 2 we des cribe the computational model in which our

lower bounds hold. Section 4 presents our lower bound for column-major layouts. Section 5 presents our

lower bound for free layouts.

We use the established ` = O(f(N, k, M, B)) notation, which here means that there exists a constant c >

0 s uch that ` ≤ c · f(N, k, M, B) for all N, k, M, B, unless otherwise stated.

1.1 Roadmap of the Lower Bound

Our lower bound is the main technical contribution of this paper, and we here give a high-level outline of it.

The overall reasoning is to prove that there are many diﬀerent inputs that all require diﬀerent program

executions, and then bound the number of essentially diﬀerent program executions with at most ` I/Os,

thereby yielding a lower bound on `. For this, we ﬁrst normalize and simplify the program execution. Then,

we bound the number of program executions by encoding the program with a short string over a small

alphab e t.

In general (for upper and lower bounds), we consider N ×N matrices with kN non-zero entries. For the

lower bound, we in particular focus on a special case of sparse matrices, namely the k-regular matrices,

characterized by exactly k non-zero entries in each column. We deﬁne the conformation of a matrix as the

locations of its non-zero entries.

We observe that in the machine model we use (deﬁned in Section 2), we may assume that the algorithm

is normalized in the sense that only ”canonical” intermediate results c an be produced. This again allows us

to uniquely assign to every intermediate result either the one variable x

it is representing, or the result c

it contributes to. Using this, we then deﬁne three traces of the computation. These are the compact

encodings describing the actions of the normalized algorithm. There is a time-forward trace of the movement

of the x

, and a time-backward trace of the movement of partial results. Additionally, based on these

two traces describing the movements, there is a third trace that gives a compact representation of the

algebraic operations. We show that these traces uniquely determine the normalized algorithm, and hence

the c onformation of the matrix it is computing.

Finally, we calculate the number of diﬀerent traces of computations with ` I/Os, and we compare it to

the number of diﬀerent k-regular conformations of N × N -matrices. This shows in particular that there are

many sparse matrices that require many I/Os.

2 Model o f Computation

Our aim is to analyze the I/O cost of computing a matrix-vector product. I/Os are generated by movement

of data, so our real object of study is the dataﬂow of matrix-vector product algorithms, and the interaction

of this dataﬂow with the memory hierarchy. Hence, we need a model of computation deﬁning the allowed

forms of intermediate results and their possible movements in the mem ory hierarchy.

In this section, we deﬁne our model. Intuitively, it contains all algorithms which compute the values

of the output vector through multiplications and additions, starting from the coeﬃcients and

variables of the input matrix and vector. The model encompasses all proposed algorithms that we are aware

of. We discuss the model further in Section 2.1.

Our model is based on the notion of a commutative semiring S, i.e., a set of numbers with addition

and multiplication, where operations are as sume d to be associative and commutative, and are distributive.

There is a neutral element 0 for addition, 1 for multiplication, and multiplication with 0 yields 0. In contrast

to a ﬁeld, there are no inverse elements guaranteed, neither for addition nor for multiplication. One well

investigated example of such a semiring (ac tually having multiplicative inverses) is the max-plus algebra

(tropical algebra), where matrix multiplication can be used to for example compute shortest paths with

negative edge length. Another semiring obviously is R with usual addition and multiplication. The semiring

model is established in the context of matrix multiplication, see Section 2.1.

We now deﬁne our model, which we call the semiring I/O-machine. It has an inﬁnite size disk D,

organized in tracks of B numbers each, and main memory containing M numbers. Accordingly, a con-

ﬁguration can be described by a vector of M numbers M = (m

, . . . , m

), and an inﬁnite sequence D of

tracks modeled by vectors t

∈ S

. A step of the computation leads to a new conﬁguration according to the

following allowed operations:

• Computation on numbers in main memory: algebraic evaluation m

:= m

+ m

, m

:= m

× m

setting m

:= 0, setting m

:= 1, and assigning m

:= m

• Input operations, each of which moves an arbitrary track of the disk into the ﬁrst B cells of memory,

, . . . , m

) := t

, t

:= 0.

• Output operations, each of which copies the ﬁrst B cells of me mory to an arbitrary track, t

, . . . , m

) and assume t

= 0.

Input and output operations are collectively called I/O operations. Note that in a input operation we move

a track. This may cause at most a factor two loss with respect to a copy operation, but it is important as it

preserves a time symmetry, as it will be clear later.

A program is a ﬁnite se quence of operations allowed in the model, and an algorithm is a set of programs.

For the sparse matrix-vector multiplication, we allow the algorithm to choose the program based on N and

Optimal sparse matrix dense vector multiplication in the I/O-model

Citations

X-Stream: edge-centric graph processing using streaming partitions

Thinking Like a Vertex: A Survey of Vertex-Centric Frameworks for Large-Scale Distributed Graph Processing

Minimizing Communication in Linear Algebra

Minimizing Communication in Numerical Linear Algebra

On the representation and multiplication of hypersparse matrices

References

Gaussian elimination is not optimal

The input/output complexity of sorting and related problems

External memory algorithms and data structures: dealing with massive data

SPARSKIT: A basic tool kit for sparse matrix computations

I/O complexity: The red-blue pebble game

Related Papers (5)

The input/output complexity of sorting and related problems

Cache-oblivious algorithms

I/O complexity: The red-blue pebble game

Communication lower bounds for distributed-memory matrix multiplication

PowerGraph: distributed graph-parallel computation on natural graphs

Frequently Asked Questions (12)

Q1. What contributions have the authors mentioned in the paper "Optimal sparse matrix dense vector multiplication in the i/o- model" ?

Q2. What is the idea used to gain efficiency over plain sorting?

Q3. How many runs are there in the algorithm?

Q4. What is the way to compute a matrix-vector product?

Q5. What is the result of Q given by a polynomial q on the input?

Q6. How many runs can be sorted using the adaptive sorting algorithm?

Q7. What is the definition of a memory hierarchy?

Q8. How does the algorithm finish phase two?

Q9. What are some techniques used to optimize register and cache use?

Q10. What is the simplest way to estimate the sums of k-regular N?

Q11. Why can't a run become longer than N?

Q12. What is the lower bound for the kN matrices?