scispace - formally typeset
Open AccessProceedings ArticleDOI

Maximizing the spread of influence through a social network

Reads0
Chats0
TLDR
An analysis framework based on submodular functions shows that a natural greedy strategy obtains a solution that is provably within 63% of optimal for several classes of models, and suggests a general approach for reasoning about the performance guarantees of algorithms for these types of influence problems in social networks.
Abstract
Models for the processes by which ideas and influence propagate through a social network have been studied in a number of domains, including the diffusion of medical and technological innovations, the sudden and widespread adoption of various strategies in game-theoretic settings, and the effects of "word of mouth" in the promotion of new products. Recently, motivated by the design of viral marketing strategies, Domingos and Richardson posed a fundamental algorithmic problem for such social network processes: if we can try to convince a subset of individuals to adopt a new product or innovation, and the goal is to trigger a large cascade of further adoptions, which set of individuals should we target?We consider this problem in several of the most widely studied models in social network analysis. The optimization problem of selecting the most influential nodes is NP-hard here, and we provide the first provable approximation guarantees for efficient algorithms. Using an analysis framework based on submodular functions, we show that a natural greedy strategy obtains a solution that is provably within 63% of optimal for several classes of models; our framework suggests a general approach for reasoning about the performance guarantees of algorithms for these types of influence problems in social networks.We also provide computational experiments on large collaboration networks, showing that in addition to their provable guarantees, our approximation algorithms significantly out-perform node-selection heuristics based on the well-studied notions of degree centrality and distance centrality from the field of social networks.

read more

Content maybe subject to copyright    Report

Maximizing the Spread of Influence through a Social
Network
David Kempe
Dept. of Computer Science
Cornell University, Ithaca NY
kempe@cs.cornell.edu
Jon Kleinberg
Dept. of Computer Science
Cornell University, Ithaca NY
kleinber@cs.cornell.edu
´
Eva Tardos
Dept. of Computer Science
Cornell University, Ithaca NY
eva@cs.cornell.edu
ABSTRACT
Models for the processes by which ideas and influence propagate
through a social network have been studied in a number of do-
mains, including the diffusion of medical and technological innova-
tions, the sudden and widespread adoption of various strategies in
game-theoretic settings, and the effects of “word of mouth” in the
promotion of new products. Recently, motivated by the design of
viral marketing strategies, Domingos and Richardson posed a fun-
damental algorithmic problem for such social network processes:
if we can try to convince a subset of individuals to adopt a new
product or innovation, and the goal is to trigger a large cascade of
further adoptions, which set of individuals should we target?
We consider this problem in several of the most widely studied
models in social network analysis. The optimization problem of
selecting the most influential nodes is NP-hard here, and we pro-
vide the first provable approximation guarantees for efficient algo-
rithms. Using an analysis framework based on submodular func-
tions, we show that a natural greedy strategy obtains a solution that
is provably within 63% of optimal for several classes of models;
our framework suggests a general approach for reasoning about the
performance guarantees of algorithms for these types of influence
problems in social networks.
We also provide computational experiments on large collabora-
tion networks, showing that in addition to their provable guaran-
tees, our approximation algorithms significantly out-perform node-
selection heuristics based on the well-studied notions of degree
centrality and distance centrality from the field of social networks.
Categories and Subject Descriptors
F.2.2 [Analysis of Algorithms and Problem Complexity]: Non-
numerical Algorithms and Problems
Supported by an Intel Graduate Fellowship and an NSF Graduate
Research Fellowship.
Supported in part by a David and Lucile Packard Foundation Fel-
lowship and NSF ITR/IM Grant IIS-0081334.
Supported in part by NSF ITR grant CCR-011337, and ONR grant
N00014-98-1-0589.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGKDD ’03 Washington, DC, USA
Copyright 2003 ACM 1-58113-737-0/03/0008 ...
$5.00.
Keywords
approximation algorithms, social networks, viral marketing,
diffusion of innovations
1. INTRODUCTION
A social network the graph of relationships and interactions
within a group of individuals plays a fundamental role as a
medium for the spread of information, ideas, and influence among
its members. An idea or innovation will appear for example, the
use of cell phones among college students, the adoption of a new
drug within the medical profession, or the rise of a political move-
ment in an unstable society and it can either die out quickly
or make significant inroads into the population. If we want to un-
derstand the extent to which such ideas are adopted, it can be im-
portant to understand how the dynamics of adoption are likely to
unfold within the underlying social network: the extent to which
people are likely to be affected by decisions of their friends and
colleagues, or the extent to which “word-of-mouth” effects will
take hold. Such network diffusion processes have a long history
of study in the social sciences. Some of the earliest systematic
investigations focused on data pertaining to the adoption of medi-
cal and agricultural innovations in both developed and developing
parts of the world [8, 27, 29]; in other contexts, research has inves-
tigated diffusion processes for “word-of-mouth” and “viral market-
ing” effects in the success of new products [4, 7, 10, 13, 14, 20, 26],
the sudden and widespread adoption of various strategies in game-
theoretic settings [6, 12, 21, 32, 33], and the problem of cascading
failures in power systems [2, 3].
In recent work, motivated by applications to marketing, Domin-
gos and Richardson posed a fundamental algorithmic problem for
such systems [10, 26]. Suppose that we have data on a social
network, with estimates for the extent to which individuals influ-
ence one another, and we would like to market a new product that
we hope will be adopted by a large fraction of the network. The
premise of viral marketing is that by initially targeting a few “influ-
ential” members of the network say, giving them free samples
of the product we can trigger a cascade of influence by which
friends will recommend the product to other friends, and many in-
dividuals will ultimately try it. But how should we choose the few
key individuals to use for seeding this process? In [10, 26], this
question was considered in a probabilistic model of interaction;
heuristics were given for choosing customers with a large overall
effect on the network, and methods were also developed to infer the
influence data necessary for posing these types of problems.
In this paper, we consider the issue of choosing influential sets of
individuals as a problem in discrete optimization. The optimal so-
lution is NP-hard for most models that have been studied, including
the model of [10]. The framework proposed in [26], on the other

hand, is based on a simple linear model where the solution to the
optimization problem can be obtained by solving a system of linear
equations. Here we focus on a collection of related, NP-hard mod-
els that have been extensively studied in the social networks com-
munity, and obtain the first provable approximation guarantees for
efficient algorithms in a number of general cases. The generality
of the models we consider lies between that of the polynomial-time
solvable model of [26] and the very general model of [10], where
the optimization problem cannot even be approximated to within a
non-trivial factor.
We begin by departing somewhat from the Domingos-Richardson
framework in the following sense: where their models are essen-
tially descriptive, specifying a joint distribution over all nodes’ be-
havior in a global sense, we focus on more operational models
from mathematical sociology [15, 28] and interacting particle sys-
tems [11, 17] that explicitly represent the step-by-step dynamics
of adoption. We show that approximation algorithms for maximiz-
ing the spread of influence in these models can be developed in
a general framework based on submodular functions [9, 23]. We
also provide computational experiments on large collaboration net-
works, showing that in addition to their provable guarantees, our al-
gorithms significantly out-perform node-selection heuristics based
on the well-studied notions of degree centrality and distance cen-
trality [30] from the field of social network analysis.
Two Basic Diffusion Models
. In considering operational models
for the spread of an idea or innovation through a social network
G, represented by a directed graph, we will speak of each indi-
vidual node as being either active (an adopter of the innovation)
or inactive. We will focus on settings, guided by the motivation
discussed above, in which each node’s tendency to become active
increases monotonically as more of its neighbors become active.
Also, we will focus for now on the progressive case in which nodes
can switch from being inactive to being active, but do not switch in
the other direction; it turns out that this assumption can easily be
lifted later. Thus, the process will look roughly as follows from the
perspective of an initially inactive node v: as time unfolds, more
and more of vs neighbors become active; at some point, this may
cause v to become active, and vs decision may in turn trigger fur-
ther decisions by nodes to which v is connected.
Granovetter and Schelling were among the first to propose mod-
els that capture such a process; their approach was based on the use
of node-specific thresholds [15, 28]. Many models of this flavor
have since been investigated (see e.g. [5, 15, 18, 19, 21, 25, 28, 29,
31, 32, 33])but the following Linear Threshold Model lies at the
core of most subsequent generalizations. In this model, a node v is
influenced by each neighbor w according to a weight b
v,w
such that
w neighbor of v
b
v,w
1. The dynamics of the process then proceed
as follows. Each node v chooses a threshold θ
v
uniformly at ran-
dom from the interval [0, 1]; this represents the weighted fraction
of vs neighbors that must become active in order for v to become
active. Given a random choice of thresholds, and an initial set of
active nodes A
0
(with all other nodes inactive), the diffusion pro-
cess unfolds deterministically in discrete steps: in step t, all nodes
that were active in step t 1 remain active, and we activate any
node v for which the total weight of its active neighbors is at least
θ
v
:
w active neighbor of v
b
v,w
θ
v
.
Thus, the thresholds θ
v
intuitively represent the different latent ten-
dencies of nodes to adopt the innovation when their neighbors do;
the fact that these are randomly selected is intended to model our
lack of knowledge of their values we are in effect averaging
over possible threshold values for all the nodes. (Another class of
approaches hard-wires all thresholds at a known value like 1/2; see
for example work by Berger [5], Morris [21], and Peleg [25].)
Based on work in interacting particle systems [11, 17] from prob-
ability theory, we can also consider dynamic cascade models for
diffusion processes. The conceptually simplest model of this type
is what one could call the Independent Cascade Model, investi-
gated recently in the context of marketing by Goldenberg, Libai,
and Muller [13, 14]. We again start with an initial set of active
nodes A
0
, and the process unfolds in discrete steps according to
the following randomized rule. When node v first becomes active
in step t, it is given a single chance to activate each currently inac-
tive neighbor w; it succeeds with a probability p
v,w
a parameter
of the system independently of the history thus far. (If w has
multiple newly activated neighbors, their attempts are sequenced in
an arbitrary order.) If v succeeds, then w will become active in step
t +1; but whether or not v succeeds, it cannot make any further at-
tempts to activate w in subsequent rounds. Again, the process runs
until no more activations are possible.
The Linear Threshold and Independent Cascade Models are two
of the most basic and widely-studied diffusion models, but of course
many extensions can be considered. We will turn to this issue later
in the paper, proposing a general framework that simultaneously
includes both of these models as special cases. For the sake of con-
creteness in the introduction, we will discuss our results in terms of
these two models in particular.
Approximation Algorithms for Influence Maximization
. We are
now in a position to formally express the Domingos-Richardson
style of optimization problem choosing a good initial set of
nodes to target in the context of the above models. Both the
Linear Threshold and Independent Cascade Models (as well as the
generalizations to follow) involve an initial set of active nodes A
0
that start the diffusion process. We define the influence of a set
of nodes A, denoted σ(A), to be the expected number of active
nodes at the end of the process, given that A is this initial active
set A
0
. The influence maximization problem asks, for a parameter
k, to find a k-node set of maximum influence. (When dealing with
algorithms for this problem, we will say that the chosen set A of
k initial active nodes has been targeted for activation by the algo-
rithm.) For the models we consider, it is NP-hard to determine the
optimum for influence maximization, as we will show later.
Our first main result is that the optimal solution for influence
maximization can be efficiently approximated to within a factor
of (1 1/e ε), in both the Linear Threshold and Independent
Cascade models; here e is the base of the natural logarithm and
ε is any positive real number. (Thus, this is a performance guar-
antee slightly better than 63%.) The algorithm that achieves this
performance guarantee is a natural greedy hill-climbing strategy
related to the approach considered in [10], and so the main con-
tent of this result is the analysis framework needed for obtaining a
provable performance guarantee, and the fairly surprising fact that
hill-climbing is always within a factor of at least 63% of optimal
for this problem. We prove this result in Section 2 using techniques
from the theory of submodular functions [9, 23], which we describe
in detail below, and which turn out to provide a natural context for
reasoning about both models and algorithms for influence maxi-
mization.
In fact, this analysis framework allows us to design and prove
guarantees for approximation algorithms in much richer and more
realistic models of the processes by which we market to nodes. The

deterministic activation of individual nodes is a highly simplified
model; an issue also considered in [10, 26] is that we may in reality
have a large number of different marketing actions available, each
of which may influence nodes in different ways. The available bud-
get can be divided arbitrarily between these actions. We show how
to extend the analysis to this substantially more general framework.
Our main result here is that a generalization of the hill-climbing al-
gorithm still provides approximation guarantees arbitrarily close to
(1 1/e).
It is worth briefly considering the general issue of performance
guarantees for algorithms in these settings. For both the Linear
Threshold and the Independent Cascade models, the influence max-
imization problem is NP-complete, but it can be approximated well.
In the linear model of Richardson and Domingos [26], on the other
hand, both the propagation of influence as well as the effect of the
initial targeting are linear. Initial marketing decisions here are thus
limited in their effect on node activations; each node’s probability
of activation is obtained as a linear combination of the effect of tar-
geting and the effect of the neighbors. In this fully linear model,
the influence can be maximized by solving a system of linear equa-
tions. In contrast, we can show that general models like that of
Domingos and Richardson [10], and even simple models that build
in a fixed threshold (like 1/2) at all nodes [5, 21, 25], lead to influ-
ence maximization problems that cannot be approximated to within
any non-trivial factor, assuming P = NP. Our analysis of approx-
imability thus suggests a way of tracing out a more delicate bound-
ary of tractability through the set of possible models, by helping to
distinguish among those for which simple heuristics provide strong
performance guarantees and those for which they can be arbitrarily
far from optimal. This in turn can suggest the development of both
more powerful algorithms, and the design of accurate models that
simultaneously allow for tractable optimization.
Following the approximation and NP-hardness results, we de-
scribe in Section 3 the results of computational experiments with
both the Linear Threshold and Independent Cascade Models, show-
ing that the hill-climbing algorithm significantly out-performs strate-
gies based on targeting high-degree or “central” nodes [30]. In Sec-
tion 4 we then develop a general model of diffusion processes in
social networks that simultaneously generalizes the Linear Thresh-
old and Independent Cascade Models, as well as a number of other
natural cases, and we show how to obtain approximation guaran-
tees for a large sub-class of these models. In Sections 5 and 6, we
also consider extensions of our approximation algorithms to mod-
els with more realistic scenarios in mind: more complex market-
ing actions as discussed above, and non-progressive processes, in
which active nodes may become inactive in subsequent steps.
2. APPROXIMATIONGUARANTEESINTHE
INDEPENDENTCASCADEANDLINEAR
THRESHOLD MODELS
The overall approach. We begin by describing our strategy for
proving approximation guarantees. Consider an arbitrary function
f(·) that maps subsets of a finite ground set U to non-negative real
numbers.
1
We say that f is submodular if it satisfies a natural “di-
minishing returns” property: the marginal gain from adding an ele-
ment to a set S is at least as high as the marginal gain from adding
1
Note that the influence function σ(·) defined above has this form;
it maps each subset A of the nodes of the social network to a real
number denoting the expected size of the activated set if A is tar-
geted for initial activation.
the same element to a superset of S. Formally, a submodular func-
tion satisfies
f(S ∪{v}) f (S) f(T ∪{v}) f (T ),
for all elements v and all pairs of sets S T .
Submodular functions have a number of very nice tractability
properties; the one that is relevant to us here is the following. Sup-
pose we have a function f that is submodular, takes only non-
negative values, and is monotone in the sense that adding an ele-
ment to a set cannot cause f to decrease: f(S ∪{v}) f(S)
for all elements v and sets S. We wish to find a k-element set S
for which f(S) is maximized. This is an NP-hard optimization
problem (it can be shown to contain the Hitting Set problem as a
simple special case), but a result of Nemhauser, Wolsey, and Fisher
[9, 23] shows that the following greedy hill-climbing algorithm ap-
proximates the optimum to within a factor of (1 1/e) (where e
is the base of the natural logarithm): start with the empty set, and
repeatedly add an element that gives the maximum marginal gain.
T
HEOREM 2.1. [9, 23] For a non-negative, monotone submod-
ular function f , let S be a set of size k obtained by selecting ele-
ments one at a time, each time choosing an element that provides
the largest marginal increase in the function value. Let S
be a
set that maximizes the value of f over all k-element sets. Then
f(S) (1 1/e)· f (S
); in other words, S provides a (1 1/e)-
approximation.
Due to its generality, this result has found applications in a num-
ber of areas of discrete optimization (see e.g. [22]); the only direct
use of it that we are aware of in the databases and data mining lit-
erature is in a context very different from ours, for the problem of
selecting database views to materialize [16].
Our strategy will be to show that for the models we are consid-
ering, the resulting influence function σ(·) is submodular. A subtle
difficulty lies in the fact that the result of Nemhauser et al. assumes
that the greedy algorithm can evaluate the underlying function ex-
actly, which may not be the case for the influence function σ(A).
However, by simulating the diffusion process and sampling the re-
sulting active sets, we are able to obtain arbitrarily close approxi-
mations to σ(A), with high probability. Furthermore, one can ex-
tend the result of Nemhauser et al. to show that for any ε>0, there
is a γ>0 such that by using (1 + γ)-approximate values for the
function to be optimized, we obtain a (11/eε)-approximation.
As mentioned in the introduction, we can extend this analysis
to a general model with more complex marketing actions that can
have a probabilistic effect on the initial activation of nodes. We
show in Section 6 how, with a more careful hill-climbing algorithm
and a generalization of Theorem 2.1, we can obtain comparable
approximation guarantees in this setting.
A further extension is to assume that each node v has an asso-
ciated non-negative weight w
v
, capturing how important it is that
v be activated in the final outcome. (For instance, if we are mar-
keting textbooks to college teachers, then the weight could be the
number of students in the teacher’s class, resulting in a larger or
smaller number of sales.) If we let B denote the (random) set ac-
tivated by the process with initial activation A, then we can define
the weighted influence function σ
w
(A) to be the expected value
over outcomes B of the quantity
vB
w
v
. The influence func-
tion studied above is the special case obtained by setting w
v
=1
for all nodes v. The objective function with weights is submodular
whenever the unweighted version is, so we can still use the greedy
algorithm for obtaining a (1 1/eε)-approximation. Note, how-
ever, that a sampling algorithm to approximately choose the next
element may need time that depends on the sizes of the weights.

Independent Cascade
In view of the above discussion, an approximation guarantee for
influence maximization in the Independent Cascade Model will be
a consequence of the following
T
HEOREM 2.2. For an arbitrary instance of the Independent
Cascade Model, the resulting influence function σ(·) is submodu-
lar.
In order to establish this result, we need to look, implicitly or
explicitly, at the expression σ(A ∪{v}) σ(A), for arbitrary sets
A and elements v. In other words, what increase do we get in the
expected number of overall activations when we add v to the set
A? This increase is very difficult to analyze directly, because it is
hard to work with quantities of the form σ(A). For example, the
Independent Cascade process is underspecified, since we have not
prescribed the order in which newly activated nodes in a given step
t will attempt to activate their neighbors. Thus, it is not initially
obvious that the process is even well-defined, in the sense that it
yields the same distribution over outcomes regardless of how we
schedule the attempted activations.
Our proof deals with these difficulties by formulating an equiv-
alent view of the process, which makes it easier to see that there
is an order-independent outcome, and which provides an alternate
way to reason about the submodularity property.
Consider a point in the cascade process when node v has just be-
come active, and it attempts to activate its neighbor w, succeeding
with probability p
v,w
. We can view the outcome of this random
event as being determined by flipping a coin of bias p
v,w
. From the
point of view of the process, it clearly does not matter whether the
coin was flipped at the moment that v became active, or whether it
was flipped at the very beginning of the whole process and is only
being revealed now. Continuing this reasoning, we can in fact as-
sume that for each pair of neighbors (v,w) in the graph, a coin of
bias p
v,w
is flipped at the very beginning of the process (indepen-
dently of the coins for all other pairs of neighbors), and the result is
stored so that it can be later checked in the event that v is activated
while w is still inactive.
With all the coins flipped in advance, the process can be viewed
as follows. The edges in G for which the coin flip indicated an
activation will be successful are declared to be live; the remaining
edges are declared to be blocked. If we fix the outcomes of the coin
flips and then initially activate a set A, it is clear how to determine
the full set of active nodes at the end of the cascade process:
C
LAIM 2.3. A node x ends up active if and only if there is a
path from some node in A to x consisting entirely of live edges.
(We will call such a path a live-edge path.)
Consider the probability space in which each sample point spec-
ifies one possible set of outcomes for all the coin flips on the edges.
Let X denote one sample point in this space, and define σ
X
(A) to
be the total number of nodes activated by the process when A is
the set initially targeted, and X is the set of outcomes of all coin
flips on edges. Because we have fixed a choice for X, σ
X
(A) is in
fact a deterministic quantity, and there is a natural way to express
its value, as follows. Let R(v, X) denote the set of all nodes that
can be reached from v on a path consisting entirely of live edges.
By Claim 2.3, σ
X
(A) is the number of nodes that can be reached
on live-edge paths from any node in A, and so it is equal to the
cardinality of the union
vA
R(v, X).
Proof of Theorem 2.2. First, we claim that for each fixed out-
come X, the function σ
X
(·) is submodular. To see this, let S and
T be two sets of nodes such that S T , and consider the quantity
σ
X
(S ∪{v})σ
X
(S). This is the number of elements in R(v, X)
that are not already in the union
uS
R(u, X); it is at least as large
as the number of elements in R(v, X) that are not in the (bigger)
union
uT
R(u, X). It follows that σ
X
(S ∪{v}) σ
X
(S)
σ
X
(T ∪{v}) σ
X
(T ), which is the defining inequality for sub-
modularity. Finally, we have
σ(A)=
outcomes X
Prob[X] · σ
X
(A),
since the expected number of nodes activated is just the weighted
average over all outcomes. But a non-negative linear combination
of submodular functions is also submodular, and hence σ(·) is sub-
modular, which concludes the proof.
Next we show the hardness of influence maximization.
T
HEOREM 2.4. The influence maximization problem is NP-hard
for the Independent Cascade model.
Proof. Consider an instance of the NP-complete Set Cover prob-
lem, defined by a collection of subsets S
1
,S
2
,... ,S
m
of a ground
set U = {u
1
,u
2
,... ,u
n
}; we wish to know whether there exist
k of the subsets whose union is equal to U. (We can assume that
k<n<m.) We show that this can be viewed as a special case of
the influence maximization problem.
Given an arbitrary instance of the Set Cover problem, we define
a corresponding directed bipartite graph with n + m nodes: there
is a node i corresponding to each set S
i
, a node j corresponding
to each element u
j
, and a directed edge (i, j) with activation prob-
ability p
i,j
=1whenever u
j
S
i
. The Set Cover problem is
equivalent to deciding if there is a set A of k nodes in this graph
with σ(A) n + k. Note that for the instance we have defined,
activation is a deterministic process, as all probabilities are 0 or
1. Initially activating the k nodes corresponding to sets in a Set
Cover solution results in activating all n nodes corresponding to
the ground set U , and if any set A of k nodes has σ(A) n + k,
then the Set Cover problem must be solvable.
Linear Thresholds
We now prove an analogous result for the Linear Threshold Model.
T
HEOREM 2.5. For an arbitrary instance of the Linear Thresh-
old Model, the resulting influence function σ(·) is submodular.
Proof. The analysis is a bit more intricate than in the proof of The-
orem 2.2, but the overall argument has a similar structure. In the
proof of Theorem 2.2, we constructed an equivalent process by ini-
tially resolving the outcomes of some random choices, considering
each outcome in isolation, and then averaging over all outcomes.
For the Linear Threshold Model, the simplest analogue would be to
consider the behavior of the process after all node thresholds have
been chosen. Unfortunately, for a fixed choice of thresholds, the
number of activated nodes is not in general a submodular function
of the targeted set; this fact necessitates a more subtle analysis.
Recall that each node v has an influence weight b
v,w
0 from
each of its neighbors w, subject to the constraint that
w
b
v,w
1.
(We can extend the notation by writing b
v,w
=0when w is not a
neighbor of v.) Suppose that v picks at most one of its incoming
edges at random, selecting the edge from w with probability b
v,w
and selecting no edge with probability 1
w
b
v,w
. The selected
edge is declared to be “live, and all other edges are declared to
be “blocked. (Note the contrast with the proof of Theorem 2.2:
there, we determined whether an edge was live independently of

the decision for each other edge; here, we negatively correlate the
decisions so that at most one live edge enters each node.)
The crux of the proof lies in establishing Claim 2.6 below, which
asserts that the Linear Threshold model is equivalent to reachabil-
ity via live-edge paths as defined above. Once that equivalence is
established, submodularity follows exactly as in the proof of The-
orem 2.2. We can define R(v, X) as before to be the set of all
nodes reachable from v on live-edge paths, subject to a choice X
of live/blocked designations for all edges; it follows that σ
X
(A) is
the cardinality of the union
vA
R(v, X), and hence a submodu-
lar function of A; finally, the function σ(·) is a non-negative linear
combination of the functions σ
X
(·) and hence also submodular.
CLAIM 2.6. For a given targeted set A, the following two dis-
tributions over sets of nodes are the same:
(i) The distribution over active sets obtained by running the Lin-
ear Threshold process to completion starting from A; and
(ii) The distribution over sets reachable from A via live-edge paths,
under the random selection of live edges defined above.
Proof. We need to prove that reachability under our random choice
of live and blocked edges defines a process equivalent to that of
the Linear Threshold Model. To obtain intuition about this equiv-
alence, it is useful to first analyze the special case in which the
underlying graph G is directed and acyclic. In this case, we can
fix a topological ordering of the nodes v
1
,v
2
,... ,v
n
(so that all
edges go from earlier nodes to later nodes in the order), and build
up the distribution of active sets by following this order. For each
node v
i
, suppose we already have determined the distribution over
active subsets of its neighbors. Then under the Linear Threshold
process, the probability that v
i
will become active, given that a sub-
set S
i
of its neighbors is active, is
wS
i
b
v
i
,w
. This is precisely
the probability that the live incoming edge selected by v
i
lies in S
i
,
and so inductively we see that the two processes define the same
distribution over active sets.
To prove the claim generally, consider a graph G that is not
acyclic. It becomes trickier to show the equivalence, because there
is no natural ordering of the nodes over which to perform induc-
tion. Instead, we argue by induction over the iterations of the Lin-
ear Threshold process. We define A
t
to be the set of active nodes
at the end of iteration t, for t =0, 1, 2,... (note that A
0
is the set
initially targeted). If node v has not become active by the end of
iteration t, then the probability that it becomes active in iteration
t +1is equal to the chance that the influence weights in A
t
\ A
t1
push it over its threshold, given that its threshold was not exceeded
already; this probability is
uA
t
\A
t1
b
v,u
1
uA
t1
b
v,u
.
On the other hand, we can run the live-edge process by revealing
the identities of the live edges gradually as follows. We start with
the targeted set A. For each node v with at least one edge from the
set A, we determine whether vs live edge comes from A. If so,
then v is reachable; but if not, we keep the source of vs live edge
unknown, subject to the condition that it comes from outside A.
Having now exposed a new set of reachable nodes A
1
in the first
stage, we proceed to identify further reachable nodes by perform-
ing the same process on edges from A
1
, and in this way produce
sets A
2
,A
3
,.... If node v has not been determined to be reachable
by the end of stage t, then the probability that it is determined to
be reachable in stage t +1is equal to the chance that its live edge
comes from A
t
\ A
t1
, given that its live edge has not come from
any of the earlier sets. But this is
uA
t
\A
t1
b
v,u
1
uA
t1
b
v,u
, which is the
same as in the Linear Threshold process of the previous paragraph.
Thus, by induction over these stages, we see that the live-edge pro-
cess produces the same distribution over active sets as the Linear
Threshold process.
Influence maximization is hard in this model as well.
T
HEOREM 2.7. The influence maximization problem is NP-hard
for the Linear Threshold model.
Proof. Consider an instance of the NP-complete Vertex Cover prob-
lem defined by an undirected n-node graph G =(V, E) and an in-
teger k; we want to know if there is a set S of k nodes in G so that
every edge has at least one endpoint in S. We show that this can be
viewed as a special case of the influence maximization problem.
Given an instance of the Vertex Cover problem involving a graph
G, we define a corresponding instance of the influence maximiza-
tion problem by directing all edges of G in both directions. If there
is a vertex cover S of size k in G, then one can deterministically
make σ(A)=n by targeting the nodes in the set A = S; con-
versely, this is the only way to get a set A with σ(A)=n.
In the proofs of both the approximation theorems in this section,
we established submodularity by considering an equivalent process
in which each node “hard-wired” certain of its incident edges as
transmitting influence from neighbors. This turns out to be a proof
technique that can be formulated in general terms, and directly ap-
plied to give approximability results for other models as well. We
discuss this further in the context of the general framework pre-
sented in Section 4.
3. EXPERIMENTS
In addition to obtaining worst-case guarantees on the perfor-
mance of our approximation algorithm, we are interested in under-
standing its behavior in practice, and comparing its performance
to other heuristics for identifying influential individuals. We find
that our greedy algorithm achieves significant performance gains
over several widely-used structural measures of influence in social
networks [30].
The Network Data
. For evaluation, it is desirable to use a network
dataset that exhibits many of the structural features of large-scale
social networks. At the same time, we do not address the issue
of inferring actual influence parameters from network observations
(see e.g. [10, 26]). Thus, for our testbed, we employ a collabo-
ration graph obtained from co-authorships in physics publications,
with simple settings of the influence parameters. It has been argued
extensively that co-authorship networks capture many of the key
features of social networks more generally [24]. The co-authorship
data was compiled from the complete list of papers in the high-
energy physics theory section of the e-print arXiv (www.arxiv.org).
2
The collaboration graph contains a node for each researcher who
has at least one paper with co-author(s) in the arXiv database. For
each paper with two or more authors, we inserted an edge for each
pair of authors (single-author papers were ignored). Notice that
this results in parallel edges when two researchers have co-authored
multiple papers we kept these parallel edges as they can be in-
terpreted to indicate stronger social ties between the researchers
involved. The resulting graph has 10748 nodes, and edges between
about 53000 pairs of nodes.
2
We also ran experiments on the co-authorship graphs induced by
theoretical computer science papers. We do not report on the results
here, as they are very similar to the ones for high-energy physics.

Citations
More filters
Proceedings ArticleDOI

What is Twitter, a social network or a news media?

TL;DR: In this paper, the authors have crawled the entire Twittersphere and found a non-power-law follower distribution, a short effective diameter, and low reciprocity, which all mark a deviation from known characteristics of human social networks.
Journal ArticleDOI

Epidemic processes in complex networks

TL;DR: A coherent and comprehensive review of the vast research activity concerning epidemic processes is presented, detailing the successful theoretical approaches as well as making their limits and assumptions clear.
Proceedings ArticleDOI

Cost-effective outbreak detection in networks

TL;DR: This work exploits submodularity to develop an efficient algorithm that scales to large problems, achieving near optimal placements, while being 700 times faster than a simple greedy algorithm and achieving speedups and savings in storage of several orders of magnitude.
Journal ArticleDOI

The dynamics of viral marketing

TL;DR: While on average recommendations are not very effective at inducing purchases and do not spread very far, this work presents a model that successfully identifies communities, product, and pricing categories for which viral marketing seems to be very effective.
Proceedings ArticleDOI

Efficient influence maximization in social networks

TL;DR: Based on the results, it is believed that fine-tuned heuristics may provide truly scalable solutions to the influence maximization problem with satisfying influence spread and blazingly fast running time.
References
More filters
Book

Diffusion of Innovations

TL;DR: A history of diffusion research can be found in this paper, where the authors present a glossary of developments in the field of Diffusion research and discuss the consequences of these developments.
Journal ArticleDOI

Diffusion of innovations

TL;DR: Upon returning to the U.S., author Singhal’s Google search revealed the following: in January 2001, the impeachment trial against President Estrada was halted by senators who supported him and the government fell without a shot being fired.
Journal ArticleDOI

Error and attack tolerance of complex networks

TL;DR: It is found that scale-free networks, which include the World-Wide Web, the Internet, social networks and cells, display an unexpected degree of robustness, the ability of their nodes to communicate being unaffected even by unrealistically high failure rates.
Book

Social Network Analysis

John Scott
TL;DR: In this article, the development of social network analysis, tracing its origins in classical sociology and its more recent formulation in social scientific and mathematical work, is described and discussed. But it is argued that the analysis of social networks is not a purely static process.
Book

Integer and Combinatorial Optimization

TL;DR: This chapter discusses the Scope of Integer and Combinatorial Optimization, as well as applications of Special-Purpose Algorithms and Matching.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What contributions have the authors mentioned in the paper "Maximizing the spread of influence through a social network" ?

Models for the processes by which ideas and influence propagate through a social network have been studied in a number of domains, including the diffusion of medical and technological innovations, the sudden and widespread adoption of various strategies in game-theoretic settings, and the effects of “ word of mouth ” in the promotion of new products. The authors consider this problem in several of the most widely studied models in social network analysis. The optimization problem of selecting the most influential nodes is NP-hard here, and the authors provide the first provable approximation guarantees for efficient algorithms. Using an analysis framework based on submodular functions, the authors show that a natural greedy strategy obtains a solution that is provably within 63 % of optimal for several classes of models ; their framework suggests a general approach for reasoning about the performance guarantees of algorithms for these types of influence problems in social networks. The authors also provide computational experiments on large collaboration networks, showing that in addition to their provable guarantees, their approximation algorithms significantly out-perform nodeselection heuristics based on the well-studied notions of degree centrality and distance centrality from the field of social networks. Recently, motivated by the design of viral marketing strategies, Domingos and Richardson posed a fundamental algorithmic problem for such social network processes: if the authors can try to convince a subset of individuals to adopt a new product or innovation, and the goal is to trigger a large cascade of further adoptions, which set of individuals should they target ? 

The degree and centrality-based heuristics are commonly used in the sociology literature as estimates of a node’s influence [30]. 

by simulating the diffusion process and sampling the resulting active sets, the authors are able to obtain arbitrarily close approximations to σ(A), with high probability. 

The generality of the models the authors consider lies between that of the polynomial-time solvable model of [26] and the very general model of [10], where the optimization problem cannot even be approximated to within a non-trivial factor. 

The weighted cascade model resembles the linear threshold model in that the expected number of neighbors who would succeed in activating a node v is 1 in both models. 

If there are at most k sets that cover all elements, then activating the nodes corresponding to these k sets will activate all of the nodes ui, and thus also all of the xj . 

If node v has not become active by the end of iteration t, then the probability that it becomes active in iteration t+1 is equal to the chance that the influence weights in At \\At−1 push it over its threshold, given that its threshold was not exceededalready; this probability is∑ u∈At\\At−1 bv,u1 − ∑u∈At−1 bv,u . 

one can extend the result of Nemhauser et al. to show that for any ε > 0, there is a γ > 0 such that by using (1 + γ)-approximate values for the function to be optimized, the authors obtain a (1−1/e−ε)-approximation. 

If the authors let B denote the (random) set activated by the process with initial activation A, then the authors can define the weighted influence function σw(A) to be the expected value over outcomes B of the quantity ∑ v∈B wv . 

The authors are concerned with the sum over all time steps t ≤ τ of the expected number of active nodes at time t, for a given a time limit τ , while [10, 26] study the limit of this process: the expected number of nodes active at time t as t goes to infinity.