scispace - formally typeset
Open AccessJournal ArticleDOI

Piggybacking on social networks

Reads0
Chats0
TLDR
It is shown that, given a social graph, social piggybacking can minimize the overall number of requests, but computing the optimal set of hubs is an NP-hard problem, and an O(log n) approximation algorithm and a heuristic are proposed to solve the problem.
Abstract
The popularity of social-networking sites has increased rapidly over the last decade. A basic functionalities of social-networking sites is to present users with streams of events shared by their friends. At a systems level, materialized per-user views are a common way to assemble and deliver such event streams on-line and with low latency. Access to the data stores, which keep the user views, is a major bottleneck of social-networking systems. We propose to improve the throughput of these systems by using social piggybacking, which consists of processing the requests of two friends by querying and updating the view of a third common friend. By using one such hub view, the system can serve requests of the first friend without querying or updating the view of the second. We show that, given a social graph, social piggybacking can minimize the overall number of requests, but computing the optimal set of hubs is an NP-hard problem. We propose an O(log n) approximation algorithm and a heuristic to solve the problem, and evaluate them using the full Twitter and Flickr social graphs, which have up to billions of edges. Compared to existing approaches, using social piggybacking results in similar throughput in systems with few servers, but enables substantial throughput improvements as the size of the system grows, reaching up to a 2-factor increase. We also evaluate our algorithms on a real social networking system prototype and we show that the actual increase in throughput corresponds nicely to the gain anticipated by our cost function.

read more

Content maybe subject to copyright    Report

Piggybacking on Social Networks
Aristides Gionis
Aalto University and HIIT
Espoo, Finland
aristides.gionis@aalto.fi
Flavio Junqueira
Microsoft Research
Cambridge, UK
fpj@microsoft.com
Vincent Leroy
Univ. of Grenoble CNRS
Grenoble, France
vincent.leroy@imag.fr
Marco Serafini
QCRI
Doha, Qatar
mserafini@qf.org.qa
Ingmar Weber
QCRI
Doha, Qatar
ingmarweber@acm.org
ABSTRACT
The popularity of social-networking sites has increased rapidly over
the last decade. A basic functionalities of social-networking sites is
to present users with streams of events shared by their friends. At a
systems level, materialized per-user views are a common way to as-
semble and deliver such event streams on-line and with low latency.
Access to the data stores, which keep the user views, is a major bot-
tleneck of social-networking systems. We propose to improve the
throughput of these systems by using social piggybacking, which
consists of processing the requests of two friends by querying and
updating the view of a third common friend. By using one such
hub view, the system can serve requests of the first friend with-
out querying or updating the view of the second. We show that,
given a social graph, social piggybacking can minimize the overall
number of requests, but computing the optimal set of hubs is an
NP-hard problem. We propose an O(log n) approximation algo-
rithm and a heuristic to solve the problem, and evaluate them using
the full Twitter and Flickr social graphs, which have up to billions
of edges. Compared to existing approaches, using social piggy-
backing results in similar throughput in systems with few servers,
but enables substantial throughput improvements as the size of the
system grows, reaching up to a 2-factor increase. We also evaluate
our algorithms on a real social networking system prototype and
we show that the actual increase in throughput corresponds nicely
to the gain anticipated by our cost function.
1. INTRODUCTION
Social networking sites have become highly popular in the past
few years. An increasing number of people use social network-
ing applications as a primary medium of finding new and inter-
esting information. Some of the most popular social networking
applications include services like Facebook, Twitter, Tumblr or Ya-
hoo! News Activity. In these applications, users establish connec-
tions with other users and share events: short text messages, URLs,
Work conducted while the authors were with Yahoo! Research.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Articles from this volume were invited to present
their results at The 39th International Conference on Very Large Data Bases,
August 26th - 30th 2013, Riva del Garda, Trento, Italy.
Proceedings of the VLDB Endowment, Vol. 6, No. 6
Copyright 2013 VLDB Endowment 2150-8097/13/04... $ 10.00.
photos, news stories, videos, and so on. Users can browse event
streams, real-time lists of recent events shared by their contacts,
on most social networking sites. A key peculiarity of social net-
working applications compared to traditional Web sites is that the
process of information dissemination is taking place in a many-
to-many fashion instead of the traditional few-to-many paradigm,
posing new system scalability challenges.
In this paper, we study the problem of assembling event streams,
which is the predominant workload of many social networking ap-
plications, e.g., 70% of the page views of Tumblr.
1
Assembling
of event streams needs to be on-line, to include the latest events for
every user, and very fast, as users expect the resulting event streams
to load in fractions of a second.
To put our work in context and to motivate our problem def-
inition, we describe the typical architecture of social networking
systems, and we discuss the process of assembling event streams.
We consider a system similar to the one depicted in Figure 1. In
such a system, information about users, the social graph, and events
shared by users are stored in back-end data stores. Users send re-
quests, such as sharing new events or receiving updates on their
event stream, to the social networking system through their browsers
or mobile apps.
A large social network with a very large number of active users
generates a massive workload. To handle this query workload and
optimize performance, the system uses materialized views. Views
are typically formed on a per-user basis, since each user sees a
different event stream. Views can contain events from a user’s
contacts and from the user itself. Our discussion is independent
of the implementation of the data stores; they could be relational
databases, key-value stores, or other data stores.
The throughput of the system is proportional to the data trans-
ferred to and from the data stores; therefore, increasing the data-
store throughput is a key problem in social networking systems.
2
In this paper, we propose optimization algorithms to reduce the
load induced on data stores—the thick red arrows in Figure 1. Our
algorithms make it possible to run the application using fewer data-
store servers or, equivalently, to increase throughput with the same
number of data-store servers.
Commercial social networking systems already use strategies to
send fewer requests to the data-store servers. A system can group
the views of the contacts of a user in two user-specific sets: the
push set, containing contact views that are updated by the data-
1
http://highscalability.com/blog/2012/2/13/tumblr-architecture-
15-billion-page-views-a-month-and-harder.html
2
http://www.facebook.com/note.php?note id=39391378919
409

!"#$%&'$()
*+'")
Social'networking'system'
(,%,)+%#"')-./'$%+)
0,11./-,2#$).#3/-4)
+#-/,.)3",15)
(,%,)+%#"'+)
0*+'")6/'7+4)
8)
Figure 1: Simplified request flow for handling event streams in
a social networking system. We focus on reducing the through-
put cost of the most complex step: querying and updating data
stores (shown with thick red arrows).
store clients when the user shares a new event, and the pull set, con-
taining contact views that are queried to assemble the user’s event
stream. The collection of push and pull sets for each user of the sys-
tem is called request schedule, and it has strong impact on perfor-
mance. Two standard request schedules are push-all and pull-all.
In push-all schedules, the push set contains all of user’s contacts,
while the pull set contains only the user’s own view. This schedule
is efficient in read-dominated workloads because each query gen-
erates only one request. Pull-all schedules are specular, and are
better suited for write-dominated workloads. More efficient sched-
ules can be identified by using a hybrid approach between pull- and
push-all, as proposed by Silberstein et al. [11]: for each pair of con-
tacts, choose between push and pull depending on how frequently
the two contacts share events and request event streams. This ap-
proach has been adopted, for example, by Tumblr.
In this paper we propose strictly cheaper schedules based on so-
cial piggybacking: the main idea is to process the requests of two
contacts by querying and updating the view of a third common con-
tact. Consider the example shown in Figure 2. For generality, we
model a social graph as a directed graph where a user may follow
another user, but the follow relationship is not necessarily symmet-
ric. In the example, Charlie’s view is in Art’s push set, so clients
insert every new event by Art into Charlie’s view. Consider now
that Billie follows both Art and Charlie. When Billie requests an
event stream, social piggybacking lets clients serving this request
pull Art’s updates from Charlie’s view, and so Charlie’s view acts
as a hub. Our main observation is that the high clustering coeffi-
cient of social networks implies the presence of many hubs, making
hub-based schedules very efficient [10].
Social piggybacking generates fewer data-store requests than ap-
proaches based on push-all, pull-all, or hybrid schedules. With a
push-all schedule, the system pushes new events by Art to Billie’s
view—the dashed thick red arrow in Figure 2(b). With a pull-all
schedule, the system queries events from Art’s view whenever Bil-
lie requests a new event stream—the dashed double green arrow
in Figure 2(b). With a hybrid schedule, the system executes the
cheaper of these two operations. With social piggybacking, the
system does not execute any of them.
Using hubs in existing social networking architectures is very
simple: it just requires a careful configuration of push and pull sets.
In this paper, we tackle the problem of calculating this configura-
tion, or in other words, the request schedule. The objective is to
minimize the overall rate of requests sent to views. We call this
problem the social-dissemination problem.
Our contribution is a comprehensive study of the problem of
social-dissemination. We first show that optimal solutions of the
social-dissemination problem either use hubs (as Charlie in Fig-
ure 2) or, when efficient hubs are not available, make pairs of users
exchange events by sending requests to their view directly. This
result reduces significantly the space of solutions that need to be
explored, simplifying the analysis.
We show that computing optimal request schedules using hubs is
NP-hard, and we propose an approximation algorithm, which we
call CHITCHAT. The hardness of our problem comes from the set-
cover problem, and naturally, our approximation algorithm is based
on a greedy strategy and achieves an O(log n) guarantee. Apply-
ing the greedy strategy, however, is non-trivial, as the iterative step
of selecting the most cost-effective subset is itself an interesting op-
timization problem, which we solve by mapping it to the weighted
densest-subgraph problem.
We then develop a heuristic, named PARALLELNOSY, which can
be used for very large social networks. PARALLELNOSY does not
have the approximation guarantee of CHITCHAT, but it is a parallel
algorithm that can be implemented as a MapReduce job and thus
scales to real-size social graphs.
CHITCHAT and PARALLELNOSY assume that the graph is static;
however, using a simple incremental technique, request schedules
can be efficiently adapted when the social graph is modified. We
show that even if the social graph is dynamic, executing an initial
optimization pays off even after adding a large number of edges to
the graph, so it is not necessary to optimize the schedule frequently.
Evaluation on the full Twitter and Flickr graphs, which have bil-
lions of edges, shows that PARALLELNOSY schedules can improve
predicted throughput by a factor of up to 2 compared to the state-
of-the-art scheduling approach of Silberstein et al. [11].
Using a social networking system prototype, we show that the
actual throughput improvement using PARALLELNOSY schedules
compared to hybrid scheduling is significant and matches very well
our predicted improvement. In small systems with few servers the
throughput is similar, but the throughput improvement grows with
the size of the system, becoming particularly significant for large
social networking systems that use hundreds of servers to serve
millions, or even billions, of requests.
3
With 500 servers, PARAL-
LELNOSY increases the throughput of the prototype by about 20%;
with 1000 servers, the increase is about 35%; eventually, as the
number of server grows, the improvement approaches the predicted
2-factor increase previously discussed. In absolute terms, this may
mean processing millions of additional requests per second.
We also compare the performance of CHITCHAT and PARAL-
LELNOSY on large samples of the actual Twitter and Flickr graphs.
CHITCHAT significantly outperforms PARALLELNOSY, showing
that there is potential for further improvements by making more
complex social piggybacking algorithms scalable.
Overall, we make the following contributions:
Introducing the concept of social piggybacking, formalizing the
social dissemination problem, and showing its NP-hardness;
Presenting the CHITCHAT approximation algorithm and show-
ing its O(log n) approximation bound;
Presenting the PARALLELNOSY heuristic, which can be paral-
lelized and scaled to very large graphs;
Evaluating the predicted throughput of PARALLELNOSY sched-
ules on full Twitter and Flickr graphs;
Measuring actual throughput on a social networking system
prototype;
Comparing CHITCHAT and PARALLELNOSY on samples of
the Twitter and Flickr graphs to explore possible further gains.
3
For an example, see: http://gigaom.com/2011/04/07/facebook-
this-is-what-webscale-looks-like/
410

Update'from'Art'
Query'from'Billie'
Data$store$
clients$
Art$
Charlie$
Billie$
Social$graph$
Data$stores$
(user$views)$
Art$
Charlie$
Billie$
(a)$ (b)$
Figure 2: Example of social piggybacking. Pushes are thick red
arrows, pulls double green ones. (a) The edge from Art to Bil-
lie can be served through Charlie if Art pushes to Charlie and
Billie pulls from Charlie. (b) Charlie’s view is a hub. Existing
approaches unnecessarily issue one of the dashed requests.
Roadmap. In Section 2 we discuss our model and present a formal
statement of the problem we consider. In Section 3 we present our
algorithms, which we evaluate in Section 4. We discuss the related
work in Section 5, and Section 6 concludes the work.
2. SOCIAL DISSEMINATION PROBLEM
We formalize the social-dissemination problem as a problem of
propagating events on a social graph. The goal is to efficiently
broadcast information from a user to its neighbors. Dissemination
must satisfy bounded staleness, a property modeling the require-
ment that event streams shall show events almost in real time. We
then show that the only request schedules satisfying bounded stal-
eness let each pair of users communicate either using direct push,
or direct pull, or social piggybacking. Finally, we analyze the com-
plexity of the social-dissemination problem and show that our re-
sults extend to more complex system models with active stores.
2.1 System model
We model the social graph as a directed graph G = (V, E). The
presence of an edge u v in the social graph indicates that the
user v subscribes to the events produced by u. We will call u a
producer and v a consumer. Symmetric social relationships can be
modeled with two directed edges u v and v u.
A user can issue two types of requests: sharing an event, such as
a text message or a picture, and requesting an updated event stream,
a real-time list of recent events shared by the producers of the user.
For the purpose of our analysis, we do not distinguish between
nodes in the graph, the corresponding users, and their materialized
views. There is one view per user. A user view contains events
from the user itself and from the other users it subscribed to; send-
ing events to uninterested users results in unnecessary additional
throughput cost, which is the metric we want to minimize.
Definition 1 (View) A view is a set of events such that if an event
produced by user u is in the view of user v, then u = v or u
v E.
Event streams and views consist of a finite list of events, filtered
according to application-specific relevance criteria. Different filter-
ing criteria can be easily adapted in our framework; however, for
generality purposes, we do not explicitly consider filtering criteria
but instead assume that all necessary past events are stored in views
and returned by queries.
A fundamental requirement for any feasible solution is that event
streams have bounded staleness: each event stream assembled for a
user u must contain every recent event shared by any producers of
u; the only events that are allowed to be missing are those shared
at most Θ time units ago. The specific value of the parameter Θ
may depend on various system parameters, such as the speed of
networks, CPUs, and external-memories, but it may also be a func-
tion of the current load of the system. The underlying motivation
of bounded staleness is that typical social applications must present
near real-time event streams, but small delays may be acceptable.
Definition 2 (Bounded staleness) There exists finite time bound Θ
such that, for each edge u v E, any query action of v issued
at any time t in any execution returns every event posted by u in the
same execution at time t Θ or before.
Note that the staleness of event streams is different from request
latency: a system might assemble event streams very quickly, but
they might contain very old events. Our work addresses the prob-
lem of request latency indirectly: improving throughput makes it
more likely to serve event streams with low latency.
In the system of Figure 2, the request schedule determines which
edges of the social graph are included in the push and pull sets of
any user. In our formal model, we consider two global pusH and
pulL sets, called H and L respectively, both subsets of the set of
edges E of the social graph. If a node u pushes events to a node
v in the model, this corresponds, in an actual system like the one
shown in Figure 2, to data-store clients updating the view of the
user v with all new events shared by user u whenever u shares them.
Similarly, if a node v pulls events from a node u, this corresponds
to data-store clients sending a query request to the view of the user
u whenever v requests its event stream. For simplicity, we assume
that users always access their own view with updates and queries.
Definition 3 (Request schedule) A request schedule is a pair
(H, L) of sets, with a push set H E and a pull set L E.
If v is in the push set of u, we say that u v H. If u is in the
pull set of v, we say that u v L.
It is important to note that all existing push-all, pull-all, and hy-
brid schedules described in Section 1 are sub-classes of the request
schedule class defined above.
The goal of social dissemination is to obtain a request schedule
that minimizes the throughput cost induced by a workload on a
social networking system. We characterize the throughput cost of a
workload as the overall rate of query and updates it induces on data-
store servers. The workload is characterized by the production rate
r
p
(u) and the consumption rate r
c
(u) of each user u. These rates
indicate the average frequency with which users share new events
and request event streams, respectively. Given an edge u v, the
cost incurred if u v H is r
p
(u), because every time u shares
a new event, an update is sent to the view of v; similarly, the cost
incurred if u v L is r
c
(v), because every event stream request
from v generates a query to the view of u.
The cost of the request schedule (H, L) is thus:
c(H, L) =
X
uvH
r
p
(u) +
X
uvL
r
c
(v).
This expression does not explicitly consider differences in the
cost of push and pull operations, modeling situations where the
messages generated by updates and queries are very small and have
similar cost. In order to model scenarios where the cost of a pull
operation is k times the cost of a push, independent of the specific
throughput metric we want to minimize (e.g., number of messages,
number of bytes transferred), it is sufficient to multiply all con-
sumption rates by a factor k. Similarly, multiplying all production
411

rates by a factor k models systems where a push is more expensive
than a pull. Note that the cost of updating and querying a user’s own
view is not represented in the cost metric because it is implicit.
2.2 Problem definition
We now define the problem that we address in this paper.
Problem 1 (DISSEMINATION) Given a graph G = (V, E), and a
workload with production and consumption rates r
p
(u) and r
c
(u)
for each node u V , find a request schedule (H , L) that guaran-
tees bounded staleness, while minimizing the cost c(H, L).
In this paper, we propose solving the DISSEMINATION problem
using social piggybacking, that is, making two nodes communicate
through a third common contact, called hub. Social piggybacking
is formally defined as follows.
Definition 4 (Piggybacking) An edge u v of a graph G(V, E)
is covered by piggybacking through a hub w V if there exists a
node w such that u w E, w v E, u w H, and
w v L.
Let be the upper bound on the time it takes for a system to
serve a user request. Piggybacking guarantees bounded staleness
with Θ = 2∆. In fact, it turns out that admissible schedules trans-
mit events over a social graph edge u v only by pushing to v,
pulling from u, or using social piggybacking over a hub.
Theorem 1 Let (H, L) be a request schedule that guarantees
bounded staleness on a social graph G = (V, E). Then for each
edge u v E, it holds that either (i) u v H, or (ii)
u v L, or (iii) u v is covered by piggybacking through a
hub w V .
PROOF. As we already discussed, all three operations satisfy the
guarantee of bounded-time delivery. We will now argue that they
are the only three such operations.
Assume that the edge u v is not served directly, but via a
path p = u w
1
. . . w
k
v. If the length of the
path p is 2, i.e., if k = 1, then simple enumeration of all cases for
paths of length 2 shows that social piggybacking is the only case
that satisfies bounded staleness in each execution. For example,
assume that both the edges u w
1
and w
1
v are push edges.
Then, delivery of an event requires that user w
1
will take some
action within a certain time bound. However, since the user w
1
may remain idle for an arbitrarily long time, we cannot guarantee
bounded staleness.
For longer paths a similar argument holds. In particular, for paths
such that k > 1, the information has to propagate along some
edge w
i
w
i+1
. The information cannot propagate along the
edge w
i
w
i+1
without one of the users w
i
or w
i+1
taking an ac-
tion, and clearly we can assume that there exist executions in which
both w
i
or w
i+1
remain idle after u has posted an event and before
the next query of v.
Even considering only the solution space restricted by Theo-
rem 1, Problem 1 is NP-hard. The proof, which uses a reduction
from the SETCOVER problem, is omitted due to lack of space.
Theorem 2 The DISSEMINATION problem is NP-hard.
So far we have considered systems where data-store servers react
only to client operations. We can call data stores that only react to
user request passive stores. Some data-store middleware enables
data-store servers to propagate information among each other too.
We generalize our result by considering a more general class of
systems called active stores, where request schedules do not only
include push and pull sets, but also propagation sets that are defined
as follows:
Definition 5 (Propagation sets) Each edge w u is associated
with a propagation set P
u
(w) V , which contains users who are
common subscribers of u and w. If the view of u stores for the first
time an event e produced by w, the data-store server pushes e to
the view of every user v P
u
(w).
We restrict the propagation of events to their subscribers to guar-
antee that a view only contains event from friends of the corre-
sponding user. We only consider active policies where data stores
take actions synchronously, when they receive requests. Some data
stores can push events asynchronously and periodically: all up-
dates received over the same period are accumulated and consid-
ered as a single update. Such schedules can be modeled as syn-
chronous schedules having an upper bound on the production rates,
determined based on the accumulation period and the communica-
tion latency between servers. Longer accumulation periods reduce
throughput cost but also increase staleness, which can be problem-
atic for highly interactive social networking applications.
The only difference between active and passive schedules is that
the formers can determine chains of pushes u w
1
. . . w
k
.
However, a chain of this form can be simulated in passive stores
by adding each edge u w
i
to H, resulting in lower or equal
latency and equal cost. This is formally shown by the following
equivalence result. The proof is omitted for lack of space.
Theorem 3 Any schedule of an active-propagation policy can be
simulated by a schedule of a passive-propagation policy with no
greater cost.
This result implies that we do not need to consider active propa-
gation in our analysis.
3. ALGORITHMS
This section introduces two algorithms to solve the DISSEMINA-
TION problem. We have shown that the problem is NP-hard, so
we propose an approximation algorithm, called CHITCHAT, and a
more scalable parallel heuristic, called PARALLELNOSY.
3.1 The CHITCHAT approximation algorithm
In this section we describe our approximation algorithm for the
DISSEMINATION problem, which we name CHITCHAT. Not sur-
prisingly, since the DISSEMINATION problem asks to find a sched-
ule that covers all the edges in the network, our algorithm is based
on the solution used for the SETCOVER problem.
For completeness we recall the SETCOVER problem: We are
given a ground set T and a collection C = {A
1
, . . . , A
m
} of sub-
sets of T , called candidates, such that
S
i
A
i
= T . Each set A
in C is associated with a cost c(A). The goal is to select a sub-
collection S C that covers all the elements in the ground set,
i.e.,
S
A∈S
A = T , and the total cost
P
A∈S
c(A) of the sets in the
collection S is minimized.
For the SETCOVER problem, the following simple greedy algo-
rithm is folklore [5]: Initialize S = to keep the iteratively grow-
ing solution, and Z = T to keep the uncovered elements of T .
Then as long as Z is not empty, select the set A C that mini-
mizes the cost per uncovered element
c(A)
|AZ|
, add the set A to the
412

X
Y
w
Figure 3: A hub-graph used in the mapping of DISSEMINATION
to SETCOVER problem. Solid edges must be served with a push
(if they point to w) or a pull (if they point from w). Dashed
edges are covered indirectly.
solution (S S {A}) and update the set of uncovered ele-
ments (Z Z \ A). It can be shown [5] that this greedy algorithm
achieves a solution with approximation guarantee O(log ∆), where
= max{|A|} is the size of the largest set in the collection C. At
the same time, this logarithmic guarantee is essentially the best one
can hope for, since Feige showed that the problem is not approx-
imable within (1 o(1)) ln n, unless NP has quasi-polynomial
time algorithms [7].
The goal of our SETCOVER variant is to identify request sched-
ules that optimize the DISSEMINATION problem. The ground set
to be covered consists of all edges in the social graph. The solution
space we identified in Section 2 indicates that the collection C con-
tains two kinds of subsets: edges that are served directly, and edges
that are served through a hub. Serving an edge u v E directly
through a push or a pull corresponds to covering using a singleton
subset {u v} C. The algorithm chooses between push and
pull according to the hybrid strategy of Silberstein et al. [11]. A
hub like the one of Figure 2(a) is a subset that covers three edges
using a push and a pull; the third edge is served indirectly. Every
time the algorithm selects a candidate from C, it adds the required
push and pull edges to the solution, the request schedule (H, L).
A straightforward application of the greedy algorithm described
above has exponential time complexity. The iterative step of the al-
gorithm must select a candidate from C, which has exponential car-
dinality because it contains all possible hubs. To our rescue comes a
well-known property about applying the greedy algorithm for solv-
ing the SETCOVER problem: a sufficient condition for applying the
greedy algorithm on SETCOVER is to have a polynomial-time or-
acle for selecting the set with the minimum cost-per-element. The
oracle can be invoked at every iterative step in order to find an (ap-
proximate) solution of the SETCOVER problem without materializ-
ing all elements of C. This makes the cardinality of C irrelevant.
The algorithmic challenge of CHITCHAT is finding a polynomial
time oracle for the DISSEMINATION problem. One key idea of
CHITCHAT is to split the oracle problem in two sub-problems, both
to be solved in polynomial time.
The first sub-problem is adding to C, for each node w, the hub-
graph centered on w that covers the largest number of edges for the
lowest cost. A hub-graph centered on w is a generalization of the
sub-graph of Figure 2(a), as depicted in Figure 3. It is a sub-graph
of the social graph where X is a set of nodes that w subscribes, and
Y is a set of nodes that subscribe to w. We refer to such hub-graphs
using the notation G(X, w, Y ).
The second sub-problem is selecting the best candidate of C.
This is now simple since C contains a linear number of hub-graph
elements and a quadratic number of singleton edges. If a hub-graph
is selected, the edges from all nodes in X to w are set to be push,
and the edges from w to all nodes in Y are set to be pull. All edges
between nodes of X and Y are covered indirectly.
The first sub-problem, finding the hub-graph centered in a given
node that covers most edges with lowest cost, is an interesting op-
timization problem in itself. In order to define the sub-problem,
we associate to each node u of a hub-graph a weight g(u) reflect-
ing the cost of u. We set g(x) = r
p
(x) for all x X, that is,
the cost of a push operation from x to w is associated to node x.
Similarly we associate the weight g(y) = r
c
(y) for each y Y .
For the hub node w, we set g(w) = 0. Let W and E(W ) be
the set of nodes and edges of the hub-graph, respectively, and let
g(W ) =
P
uW
g(u). The cost-per-element of the hub-graph is:
p(W ) =
g(W )
|E(W )|
. (1)
The sub-problem can thus be formulated as finding, for each node
w of the social graph, the hub-graph (W, E(W )) centered on w
that minimizes p(W ).
Careful inspection of Equation (1) motivates us to consider the
following problem.
Problem 2 (DENSESTSUBGRAPH) Let G = (V, E) be a graph.
For a set S V , E(S) denotes the set of edges of G between
nodes of S. The DENSESTSUBGRAPH problem asks to find the
subset S that maximizes the density function d(S) =
|E(S)|
|S|
.
If we weight the nodes of S using the g function define above,
we can obtain a weighted variant of this problem by replacing the
density function d(S) with d
w
(S) = |E(S)|/g(S).
Let G
w
be the largest hub-graph centered in a node w, the one
where X and Y include all producers and consumers of w, respec-
tively. Any subgraph (S, E(S)) of G
w
that maximizes d
w
(S) min-
imizes p(S). Therefore, any solution of the weighted version of
DENSESTSUBGRAPH will give us the hub-graph centered on w to
be included in C.
Interestingly, although many variants of dense-subgraph prob-
lems are NP-hard, Problem 2 can be solved exactly in polynomial
time. Given that we are looking for a solution of the SETCOVER
problem with a logarithmic approximation factor, we set for the
simple greedy algorithm analyzed by Asahiro et al. [1] and later
by Charikar [3]. This algorithm gives a 2-factor approximation for
Problem 2, and its running time is linear in the number of edges
in the graph. The algorithm is the following. Start with the whole
graph. Until left with an empty graph, iteratively remove the node
with the lowest degree (breaking ties arbitrarily) and all its incident
edges. Among all subgraphs considered during the execution of the
algorithm return the one with the maximum density.
The above algorithm works for the case that the density of a sub-
graph is d(S). In our case we want to maximize the weighted-
density function d
w
(S). Thus we modify the greedy algorithm of
Asahiro et al. and Charikar as follows. In each iteration, instead of
deleting the node with the lowest degree, we delete the node that
minimizes a notion of weighted degree, defined as d
g
(u) =
d(u)
g(u)
,
where d(u) is the normal notion of degree of node u. We can show
that this modified algorithm yields a factor-2 approximation for the
weighted version of the DENSESTSUBGRAPH problem.
Lemma 1 Given a graph G
w
= (S, E(S)), there exists a linear-
time algorithm solving the weighted variant of the DENSESTSUB-
GRAPH problem within an approximation factor of 2.
PROOF. We prove the lemma by modifying the analysis of Cha-
rikar [3]. Let f (S) =
E(S)
g(S)
be the objective function to optimize,
413

Citations
More filters

Butterfly Effect: Peeling Bipartite Networks.

TL;DR: This paper proposes peeling algorithms to find many dense substructures and a hierarchy among them on bipartite graphs, based on the butterfly subgraphs (2,2-bicliques), and shows that they can identify much denser structures compared to the state-of-the-art techniques on co-occurrence graphs.
Journal ArticleDOI

Piggyback Game: Efficient Event Stream Dissemination in Online Social Network Systems

TL;DR: This work mathematically prove the existence of the Nash Equilibrium of the social piggyback game, which quickly converges in a small number of iterations for large-scale OSNs, and proposes an efficient best response dynamic algorithm to achieve the NashEquilibrium.
Proceedings ArticleDOI

Materialized view selection in feed following systems

TL;DR: This work proposes several cost-based algorithms to solve the global view selection problem, which adopt a cost model that captures the cost of both user query processing and view maintenance and make use of the containment relationships among the sets of feeds followed by the individual users.
Journal ArticleDOI

Finding locally densest subgraphs

TL;DR: This paper develops a convex-programming-based solution that enables powerful pruning in graph databases and studies the locally densest subgraph (LDS), a recently-proposed variant of DS.
Proceedings ArticleDOI

Piggyback game: Efficient event stream dissemination in Online Social Network systems

TL;DR: This work proposes a novel social piggyback game to achieve a more efficient solution to the high complex graph computation problem into individuals' rational strategy selection of each node, and mathematically proves the existing of the Nash Equilibrium of the social piglyback game.
References
More filters
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Journal ArticleDOI

The Structure and Function of Complex Networks

Mark Newman
- 01 Jan 2003 - 
TL;DR: Developments in this field are reviewed, including such concepts as the small-world effect, degree distributions, clustering, network correlations, random graph models, models of network growth and preferential attachment, and dynamical processes taking place on networks.
Proceedings Article

Measuring User Influence in Twitter: The Million Follower Fallacy

TL;DR: An in-depth comparison of three measures of influence, using a large amount of data collected from Twitter, is presented, suggesting that topological measures such as indegree alone reveals very little about the influence of a user.
Journal ArticleDOI

A threshold of ln n for approximating set cover

TL;DR: It is proved that (1 - o(1) ln n setcover is a threshold below which setcover cannot be approximated efficiently, unless NP has slightlysuperpolynomial time algorithms.
Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "Piggybacking on social networks" ?

The authors propose to improve the throughput of these systems by using social piggybacking, which consists of processing the requests of two friends by querying and updating the view of a third common friend. The authors show that, given a social graph, social piggybacking can minimize the overall number of requests, but computing the optimal set of hubs is an NP-hard problem. The authors propose an O ( logn ) approximation algorithm and a heuristic to solve the problem, and evaluate them using the full Twitter and Flickr social graphs, which have up to billions of edges. The authors also evaluate their algorithms on a real social networking system prototype and they show that the actual increase in throughput corresponds nicely to the gain anticipated by their cost function.