scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Piggybacking on social networks

01 Apr 2013-Vol. 6, Iss: 6, pp 409-420
TL;DR: It is shown that, given a social graph, social piggybacking can minimize the overall number of requests, but computing the optimal set of hubs is an NP-hard problem, and an O(log n) approximation algorithm and a heuristic are proposed to solve the problem.
Abstract: The popularity of social-networking sites has increased rapidly over the last decade. A basic functionalities of social-networking sites is to present users with streams of events shared by their friends. At a systems level, materialized per-user views are a common way to assemble and deliver such event streams on-line and with low latency. Access to the data stores, which keep the user views, is a major bottleneck of social-networking systems. We propose to improve the throughput of these systems by using social piggybacking, which consists of processing the requests of two friends by querying and updating the view of a third common friend. By using one such hub view, the system can serve requests of the first friend without querying or updating the view of the second. We show that, given a social graph, social piggybacking can minimize the overall number of requests, but computing the optimal set of hubs is an NP-hard problem. We propose an O(log n) approximation algorithm and a heuristic to solve the problem, and evaluate them using the full Twitter and Flickr social graphs, which have up to billions of edges. Compared to existing approaches, using social piggybacking results in similar throughput in systems with few servers, but enables substantial throughput improvements as the size of the system grows, reaching up to a 2-factor increase. We also evaluate our algorithms on a real social networking system prototype and we show that the actual increase in throughput corresponds nicely to the gain anticipated by our cost function.

Summary (3 min read)

1. INTRODUCTION

  • Social networking sites have become highly popular in the past few years.
  • To put their work in context and to motivate their problem definition, the authors describe the typical architecture of social networking systems, and they discuss the process of assembling event streams.
  • The collection of push and pull sets for each user of the system is called request schedule, and it has strong impact on performance.
  • CHITCHAT and PARALLELNOSY assume that the graph is static; however, using a simple incremental technique, request schedules can be efficiently adapted when the social graph is modified.

2. SOCIAL DISSEMINATION PROBLEM

  • Dissemination must satisfy bounded staleness, a property modeling the requirement that event streams shall show events almost in real time.
  • The authors then show that the only request schedules satisfying bounded staleness let each pair of users communicate either using direct push, or direct pull, or social piggybacking.
  • Finally, the authors analyze the complexity of the social-dissemination problem and show that their results extend to more complex system models with active stores.

2.1 System model

  • For the purpose of their analysis, the authors do not distinguish between nodes in the graph, the corresponding users, and their materialized views.
  • Event streams and views consist of a finite list of events, filtered according to application-specific relevance criteria.
  • In the system of Figure 2, the request schedule determines which edges of the social graph are included in the push and pull sets of any user.
  • The workload is characterized by the production rate rp(u) and the consumption rate rc(u) of each user u.
  • Note that the cost of updating and querying a user’s own view is not represented in the cost metric because it is implicit.

2.2 Problem definition

  • The authors now define the problem that they address in this paper.
  • The authors propose solving the DISSEMINATION problem using social piggybacking, that is, making two nodes communicate through a third common contact, called hub.
  • Since the user w1 may remain idle for an arbitrarily long time, one cannot guarantee bounded staleness.the authors.
  • The authors can call data stores that only react to user request passive stores.
  • This is formally shown by the following equivalence result.

3. ALGORITHMS

  • This section introduces two algorithms to solve the DISSEMINATION problem.
  • Every time the algorithm selects a candidate from C, it adds the required push and pull edges to the solution, the request schedule (H,L).
  • Given that the authors are looking for a solution of the SETCOVER problem with a logarithmic approximation factor, they set for the simple greedy algorithm analyzed by Asahiro et al. [1] and later by Charikar [3].
  • The authors can show that this modified algorithm yields a factor-2 approximation for the weighted version of the DENSESTSUBGRAPH problem.
  • In the edge locking phase, each candidate hub-graph tries to lock its edges.

3.3 Incremental updates

  • PARALLELNOSY and CHITCHAT optimize a static social graph.
  • Over time, graph updates let the quality of the dissemination schedule degrade, so their algorithms can be executed periodically to re-optimize cost.
  • The experimental evaluation of Section 4 indicates that their algorithm does not need to be re-executed frequently.

4. EVALUATION

  • The authors evaluate the throughput performance of the proposed algorithm, contrasting it against the best available scheduling algorithm, the hybrid policy of Silberstein et al. [11].
  • The authors evaluation is both analytical, considering their cost metric of Section 2.1, and experimental, using measurements on a social networking system prototype.
  • The authors show that the PARALLELNOSY heuristic scales to real-world social graphs and doubles the throughput of social networking systems compared to hybrid schedules.
  • On a real prototype, PARALLELNOSY provides similar throughput as hybrid schedules when the system is composed by few servers; as the system grows, the throughput improvement becomes more evident, approaching the 2-factor analytical improvement.
  • The authors also evaluate the relative performance of the two proposed algorithms PARALLELNOSY and CHITCHAT.

4.1 Input data

  • The authors obtain datasets from two social graphs: flickr, as of April 2008, and twitter, as of August 2009.
  • The twitter graph has been made available by Cha et al. [2].
  • The authors algorithms also require input workloads: production and consumption rates for all the nodes in the network.
  • It has been observed by Huberman et al. that nodes with many followers tend to have a higher production rate, and nodes following many other nodes tend to have a higher consumption rate [8].

4.2 Social piggybacking on large social graphs

  • The authors run their MapReduce implementation of the PARALLELNOSY heuristic on the full twitter and flickr graphs.
  • As discussed in Section 3.2, very large social graphs may contain millions of cross-edges for a single hub-graph.
  • The authors quantify the performance of their algorithms by measuring their throughput compared against a baseline.
  • For both social graphs, the throughput of the PARALLELNOSY schedule increases sharply during the first iterations and it quickly stabilizes.
  • The larger stabilization time for twitter is due to the incremental detection of cross-edges at every cycle, as discussed before.

4.3 Prototype performance

  • In the previous section the authors evaluated their algorithms in terms of the predicted cost function that the algorithms optimize.
  • When processing a user query, application servers send at most one query per data store server s, which replies with a list of events filtered from all views v ∈ l stored by s, also known as The authors use batching.
  • Using data partitioning information as input of the DISSEMINATION problem is attractive, but has two main drawbacks.
  • The authors found that, if the network does not become a bottleneck, the overall throughput using n clients and n servers is about n times the perclient throughput with n servers.
  • Note that, since the y axis is logarithmic, the divergence between the algorithms and the error bars on the right side of the graph are magnified.

4.4 The potential of social piggybacking

  • The previous experiments show that PARALLELNOSY is an effective heuristic for real-world large-scale social networking systems.
  • In the experiments discussed below the authors use five graph samples; the plots report averages.
  • As for random-walk sampling, existing work has pointed out that it preserves certain clustering metrics; more precisely, in both the original and sampled graphs, nodes with the same degree have similar ratio of actual and potential edges between their neighbors [9].
  • This reduces the relative gain of social piggybacking since the hybrid schedule of Silberstein et al. (our baseline) uses per-edge optimizations that do not depend on the degree of nodes.

6. CONCLUSION

  • Assembling and delivering event streams is a major feature of social networking systems and imposes a heavy load on back-end data stores.
  • The authors proposed two algorithms to compute request schedules that leverage social piggybacking.
  • The CHITCHAT algorithm is an approximation algorithm that uses a novel combination of the SETCOVER and DENSESTSUBGRAPH and has an approximation factor of O(lnn).
  • The PARALLELNOSY heuristic is a parallel algorithm that can scale to large social graphs.
  • In small systems, the authors obtained similar throughput as existing hybrid approaches, but as the size of the system grows beyond a few hundreds of servers, the throughput grows significantly, reaching a limit of a 2-factor improvement.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Piggybacking on Social Networks
Aristides Gionis
Aalto University and HIIT
Espoo, Finland
aristides.gionis@aalto.fi
Flavio Junqueira
Microsoft Research
Cambridge, UK
fpj@microsoft.com
Vincent Leroy
Univ. of Grenoble CNRS
Grenoble, France
vincent.leroy@imag.fr
Marco Serafini
QCRI
Doha, Qatar
mserafini@qf.org.qa
Ingmar Weber
QCRI
Doha, Qatar
ingmarweber@acm.org
ABSTRACT
The popularity of social-networking sites has increased rapidly over
the last decade. A basic functionalities of social-networking sites is
to present users with streams of events shared by their friends. At a
systems level, materialized per-user views are a common way to as-
semble and deliver such event streams on-line and with low latency.
Access to the data stores, which keep the user views, is a major bot-
tleneck of social-networking systems. We propose to improve the
throughput of these systems by using social piggybacking, which
consists of processing the requests of two friends by querying and
updating the view of a third common friend. By using one such
hub view, the system can serve requests of the first friend with-
out querying or updating the view of the second. We show that,
given a social graph, social piggybacking can minimize the overall
number of requests, but computing the optimal set of hubs is an
NP-hard problem. We propose an O(log n) approximation algo-
rithm and a heuristic to solve the problem, and evaluate them using
the full Twitter and Flickr social graphs, which have up to billions
of edges. Compared to existing approaches, using social piggy-
backing results in similar throughput in systems with few servers,
but enables substantial throughput improvements as the size of the
system grows, reaching up to a 2-factor increase. We also evaluate
our algorithms on a real social networking system prototype and
we show that the actual increase in throughput corresponds nicely
to the gain anticipated by our cost function.
1. INTRODUCTION
Social networking sites have become highly popular in the past
few years. An increasing number of people use social network-
ing applications as a primary medium of finding new and inter-
esting information. Some of the most popular social networking
applications include services like Facebook, Twitter, Tumblr or Ya-
hoo! News Activity. In these applications, users establish connec-
tions with other users and share events: short text messages, URLs,
Work conducted while the authors were with Yahoo! Research.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Articles from this volume were invited to present
their results at The 39th International Conference on Very Large Data Bases,
August 26th - 30th 2013, Riva del Garda, Trento, Italy.
Proceedings of the VLDB Endowment, Vol. 6, No. 6
Copyright 2013 VLDB Endowment 2150-8097/13/04... $ 10.00.
photos, news stories, videos, and so on. Users can browse event
streams, real-time lists of recent events shared by their contacts,
on most social networking sites. A key peculiarity of social net-
working applications compared to traditional Web sites is that the
process of information dissemination is taking place in a many-
to-many fashion instead of the traditional few-to-many paradigm,
posing new system scalability challenges.
In this paper, we study the problem of assembling event streams,
which is the predominant workload of many social networking ap-
plications, e.g., 70% of the page views of Tumblr.
1
Assembling
of event streams needs to be on-line, to include the latest events for
every user, and very fast, as users expect the resulting event streams
to load in fractions of a second.
To put our work in context and to motivate our problem def-
inition, we describe the typical architecture of social networking
systems, and we discuss the process of assembling event streams.
We consider a system similar to the one depicted in Figure 1. In
such a system, information about users, the social graph, and events
shared by users are stored in back-end data stores. Users send re-
quests, such as sharing new events or receiving updates on their
event stream, to the social networking system through their browsers
or mobile apps.
A large social network with a very large number of active users
generates a massive workload. To handle this query workload and
optimize performance, the system uses materialized views. Views
are typically formed on a per-user basis, since each user sees a
different event stream. Views can contain events from a user’s
contacts and from the user itself. Our discussion is independent
of the implementation of the data stores; they could be relational
databases, key-value stores, or other data stores.
The throughput of the system is proportional to the data trans-
ferred to and from the data stores; therefore, increasing the data-
store throughput is a key problem in social networking systems.
2
In this paper, we propose optimization algorithms to reduce the
load induced on data stores—the thick red arrows in Figure 1. Our
algorithms make it possible to run the application using fewer data-
store servers or, equivalently, to increase throughput with the same
number of data-store servers.
Commercial social networking systems already use strategies to
send fewer requests to the data-store servers. A system can group
the views of the contacts of a user in two user-specific sets: the
push set, containing contact views that are updated by the data-
1
http://highscalability.com/blog/2012/2/13/tumblr-architecture-
15-billion-page-views-a-month-and-harder.html
2
http://www.facebook.com/note.php?note id=39391378919
409

!"#$%&'$()
*+'")
Social'networking'system'
(,%,)+%#"')-./'$%+)
0,11./-,2#$).#3/-4)
+#-/,.)3",15)
(,%,)+%#"'+)
0*+'")6/'7+4)
8)
Figure 1: Simplified request flow for handling event streams in
a social networking system. We focus on reducing the through-
put cost of the most complex step: querying and updating data
stores (shown with thick red arrows).
store clients when the user shares a new event, and the pull set, con-
taining contact views that are queried to assemble the user’s event
stream. The collection of push and pull sets for each user of the sys-
tem is called request schedule, and it has strong impact on perfor-
mance. Two standard request schedules are push-all and pull-all.
In push-all schedules, the push set contains all of user’s contacts,
while the pull set contains only the user’s own view. This schedule
is efficient in read-dominated workloads because each query gen-
erates only one request. Pull-all schedules are specular, and are
better suited for write-dominated workloads. More efficient sched-
ules can be identified by using a hybrid approach between pull- and
push-all, as proposed by Silberstein et al. [11]: for each pair of con-
tacts, choose between push and pull depending on how frequently
the two contacts share events and request event streams. This ap-
proach has been adopted, for example, by Tumblr.
In this paper we propose strictly cheaper schedules based on so-
cial piggybacking: the main idea is to process the requests of two
contacts by querying and updating the view of a third common con-
tact. Consider the example shown in Figure 2. For generality, we
model a social graph as a directed graph where a user may follow
another user, but the follow relationship is not necessarily symmet-
ric. In the example, Charlie’s view is in Art’s push set, so clients
insert every new event by Art into Charlie’s view. Consider now
that Billie follows both Art and Charlie. When Billie requests an
event stream, social piggybacking lets clients serving this request
pull Art’s updates from Charlie’s view, and so Charlie’s view acts
as a hub. Our main observation is that the high clustering coeffi-
cient of social networks implies the presence of many hubs, making
hub-based schedules very efficient [10].
Social piggybacking generates fewer data-store requests than ap-
proaches based on push-all, pull-all, or hybrid schedules. With a
push-all schedule, the system pushes new events by Art to Billie’s
view—the dashed thick red arrow in Figure 2(b). With a pull-all
schedule, the system queries events from Art’s view whenever Bil-
lie requests a new event stream—the dashed double green arrow
in Figure 2(b). With a hybrid schedule, the system executes the
cheaper of these two operations. With social piggybacking, the
system does not execute any of them.
Using hubs in existing social networking architectures is very
simple: it just requires a careful configuration of push and pull sets.
In this paper, we tackle the problem of calculating this configura-
tion, or in other words, the request schedule. The objective is to
minimize the overall rate of requests sent to views. We call this
problem the social-dissemination problem.
Our contribution is a comprehensive study of the problem of
social-dissemination. We first show that optimal solutions of the
social-dissemination problem either use hubs (as Charlie in Fig-
ure 2) or, when efficient hubs are not available, make pairs of users
exchange events by sending requests to their view directly. This
result reduces significantly the space of solutions that need to be
explored, simplifying the analysis.
We show that computing optimal request schedules using hubs is
NP-hard, and we propose an approximation algorithm, which we
call CHITCHAT. The hardness of our problem comes from the set-
cover problem, and naturally, our approximation algorithm is based
on a greedy strategy and achieves an O(log n) guarantee. Apply-
ing the greedy strategy, however, is non-trivial, as the iterative step
of selecting the most cost-effective subset is itself an interesting op-
timization problem, which we solve by mapping it to the weighted
densest-subgraph problem.
We then develop a heuristic, named PARALLELNOSY, which can
be used for very large social networks. PARALLELNOSY does not
have the approximation guarantee of CHITCHAT, but it is a parallel
algorithm that can be implemented as a MapReduce job and thus
scales to real-size social graphs.
CHITCHAT and PARALLELNOSY assume that the graph is static;
however, using a simple incremental technique, request schedules
can be efficiently adapted when the social graph is modified. We
show that even if the social graph is dynamic, executing an initial
optimization pays off even after adding a large number of edges to
the graph, so it is not necessary to optimize the schedule frequently.
Evaluation on the full Twitter and Flickr graphs, which have bil-
lions of edges, shows that PARALLELNOSY schedules can improve
predicted throughput by a factor of up to 2 compared to the state-
of-the-art scheduling approach of Silberstein et al. [11].
Using a social networking system prototype, we show that the
actual throughput improvement using PARALLELNOSY schedules
compared to hybrid scheduling is significant and matches very well
our predicted improvement. In small systems with few servers the
throughput is similar, but the throughput improvement grows with
the size of the system, becoming particularly significant for large
social networking systems that use hundreds of servers to serve
millions, or even billions, of requests.
3
With 500 servers, PARAL-
LELNOSY increases the throughput of the prototype by about 20%;
with 1000 servers, the increase is about 35%; eventually, as the
number of server grows, the improvement approaches the predicted
2-factor increase previously discussed. In absolute terms, this may
mean processing millions of additional requests per second.
We also compare the performance of CHITCHAT and PARAL-
LELNOSY on large samples of the actual Twitter and Flickr graphs.
CHITCHAT significantly outperforms PARALLELNOSY, showing
that there is potential for further improvements by making more
complex social piggybacking algorithms scalable.
Overall, we make the following contributions:
Introducing the concept of social piggybacking, formalizing the
social dissemination problem, and showing its NP-hardness;
Presenting the CHITCHAT approximation algorithm and show-
ing its O(log n) approximation bound;
Presenting the PARALLELNOSY heuristic, which can be paral-
lelized and scaled to very large graphs;
Evaluating the predicted throughput of PARALLELNOSY sched-
ules on full Twitter and Flickr graphs;
Measuring actual throughput on a social networking system
prototype;
Comparing CHITCHAT and PARALLELNOSY on samples of
the Twitter and Flickr graphs to explore possible further gains.
3
For an example, see: http://gigaom.com/2011/04/07/facebook-
this-is-what-webscale-looks-like/
410

Update'from'Art'
Query'from'Billie'
Data$store$
clients$
Art$
Charlie$
Billie$
Social$graph$
Data$stores$
(user$views)$
Art$
Charlie$
Billie$
(a)$ (b)$
Figure 2: Example of social piggybacking. Pushes are thick red
arrows, pulls double green ones. (a) The edge from Art to Bil-
lie can be served through Charlie if Art pushes to Charlie and
Billie pulls from Charlie. (b) Charlie’s view is a hub. Existing
approaches unnecessarily issue one of the dashed requests.
Roadmap. In Section 2 we discuss our model and present a formal
statement of the problem we consider. In Section 3 we present our
algorithms, which we evaluate in Section 4. We discuss the related
work in Section 5, and Section 6 concludes the work.
2. SOCIAL DISSEMINATION PROBLEM
We formalize the social-dissemination problem as a problem of
propagating events on a social graph. The goal is to efficiently
broadcast information from a user to its neighbors. Dissemination
must satisfy bounded staleness, a property modeling the require-
ment that event streams shall show events almost in real time. We
then show that the only request schedules satisfying bounded stal-
eness let each pair of users communicate either using direct push,
or direct pull, or social piggybacking. Finally, we analyze the com-
plexity of the social-dissemination problem and show that our re-
sults extend to more complex system models with active stores.
2.1 System model
We model the social graph as a directed graph G = (V, E). The
presence of an edge u v in the social graph indicates that the
user v subscribes to the events produced by u. We will call u a
producer and v a consumer. Symmetric social relationships can be
modeled with two directed edges u v and v u.
A user can issue two types of requests: sharing an event, such as
a text message or a picture, and requesting an updated event stream,
a real-time list of recent events shared by the producers of the user.
For the purpose of our analysis, we do not distinguish between
nodes in the graph, the corresponding users, and their materialized
views. There is one view per user. A user view contains events
from the user itself and from the other users it subscribed to; send-
ing events to uninterested users results in unnecessary additional
throughput cost, which is the metric we want to minimize.
Definition 1 (View) A view is a set of events such that if an event
produced by user u is in the view of user v, then u = v or u
v E.
Event streams and views consist of a finite list of events, filtered
according to application-specific relevance criteria. Different filter-
ing criteria can be easily adapted in our framework; however, for
generality purposes, we do not explicitly consider filtering criteria
but instead assume that all necessary past events are stored in views
and returned by queries.
A fundamental requirement for any feasible solution is that event
streams have bounded staleness: each event stream assembled for a
user u must contain every recent event shared by any producers of
u; the only events that are allowed to be missing are those shared
at most Θ time units ago. The specific value of the parameter Θ
may depend on various system parameters, such as the speed of
networks, CPUs, and external-memories, but it may also be a func-
tion of the current load of the system. The underlying motivation
of bounded staleness is that typical social applications must present
near real-time event streams, but small delays may be acceptable.
Definition 2 (Bounded staleness) There exists finite time bound Θ
such that, for each edge u v E, any query action of v issued
at any time t in any execution returns every event posted by u in the
same execution at time t Θ or before.
Note that the staleness of event streams is different from request
latency: a system might assemble event streams very quickly, but
they might contain very old events. Our work addresses the prob-
lem of request latency indirectly: improving throughput makes it
more likely to serve event streams with low latency.
In the system of Figure 2, the request schedule determines which
edges of the social graph are included in the push and pull sets of
any user. In our formal model, we consider two global pusH and
pulL sets, called H and L respectively, both subsets of the set of
edges E of the social graph. If a node u pushes events to a node
v in the model, this corresponds, in an actual system like the one
shown in Figure 2, to data-store clients updating the view of the
user v with all new events shared by user u whenever u shares them.
Similarly, if a node v pulls events from a node u, this corresponds
to data-store clients sending a query request to the view of the user
u whenever v requests its event stream. For simplicity, we assume
that users always access their own view with updates and queries.
Definition 3 (Request schedule) A request schedule is a pair
(H, L) of sets, with a push set H E and a pull set L E.
If v is in the push set of u, we say that u v H. If u is in the
pull set of v, we say that u v L.
It is important to note that all existing push-all, pull-all, and hy-
brid schedules described in Section 1 are sub-classes of the request
schedule class defined above.
The goal of social dissemination is to obtain a request schedule
that minimizes the throughput cost induced by a workload on a
social networking system. We characterize the throughput cost of a
workload as the overall rate of query and updates it induces on data-
store servers. The workload is characterized by the production rate
r
p
(u) and the consumption rate r
c
(u) of each user u. These rates
indicate the average frequency with which users share new events
and request event streams, respectively. Given an edge u v, the
cost incurred if u v H is r
p
(u), because every time u shares
a new event, an update is sent to the view of v; similarly, the cost
incurred if u v L is r
c
(v), because every event stream request
from v generates a query to the view of u.
The cost of the request schedule (H, L) is thus:
c(H, L) =
X
uvH
r
p
(u) +
X
uvL
r
c
(v).
This expression does not explicitly consider differences in the
cost of push and pull operations, modeling situations where the
messages generated by updates and queries are very small and have
similar cost. In order to model scenarios where the cost of a pull
operation is k times the cost of a push, independent of the specific
throughput metric we want to minimize (e.g., number of messages,
number of bytes transferred), it is sufficient to multiply all con-
sumption rates by a factor k. Similarly, multiplying all production
411

rates by a factor k models systems where a push is more expensive
than a pull. Note that the cost of updating and querying a user’s own
view is not represented in the cost metric because it is implicit.
2.2 Problem definition
We now define the problem that we address in this paper.
Problem 1 (DISSEMINATION) Given a graph G = (V, E), and a
workload with production and consumption rates r
p
(u) and r
c
(u)
for each node u V , find a request schedule (H , L) that guaran-
tees bounded staleness, while minimizing the cost c(H, L).
In this paper, we propose solving the DISSEMINATION problem
using social piggybacking, that is, making two nodes communicate
through a third common contact, called hub. Social piggybacking
is formally defined as follows.
Definition 4 (Piggybacking) An edge u v of a graph G(V, E)
is covered by piggybacking through a hub w V if there exists a
node w such that u w E, w v E, u w H, and
w v L.
Let be the upper bound on the time it takes for a system to
serve a user request. Piggybacking guarantees bounded staleness
with Θ = 2∆. In fact, it turns out that admissible schedules trans-
mit events over a social graph edge u v only by pushing to v,
pulling from u, or using social piggybacking over a hub.
Theorem 1 Let (H, L) be a request schedule that guarantees
bounded staleness on a social graph G = (V, E). Then for each
edge u v E, it holds that either (i) u v H, or (ii)
u v L, or (iii) u v is covered by piggybacking through a
hub w V .
PROOF. As we already discussed, all three operations satisfy the
guarantee of bounded-time delivery. We will now argue that they
are the only three such operations.
Assume that the edge u v is not served directly, but via a
path p = u w
1
. . . w
k
v. If the length of the
path p is 2, i.e., if k = 1, then simple enumeration of all cases for
paths of length 2 shows that social piggybacking is the only case
that satisfies bounded staleness in each execution. For example,
assume that both the edges u w
1
and w
1
v are push edges.
Then, delivery of an event requires that user w
1
will take some
action within a certain time bound. However, since the user w
1
may remain idle for an arbitrarily long time, we cannot guarantee
bounded staleness.
For longer paths a similar argument holds. In particular, for paths
such that k > 1, the information has to propagate along some
edge w
i
w
i+1
. The information cannot propagate along the
edge w
i
w
i+1
without one of the users w
i
or w
i+1
taking an ac-
tion, and clearly we can assume that there exist executions in which
both w
i
or w
i+1
remain idle after u has posted an event and before
the next query of v.
Even considering only the solution space restricted by Theo-
rem 1, Problem 1 is NP-hard. The proof, which uses a reduction
from the SETCOVER problem, is omitted due to lack of space.
Theorem 2 The DISSEMINATION problem is NP-hard.
So far we have considered systems where data-store servers react
only to client operations. We can call data stores that only react to
user request passive stores. Some data-store middleware enables
data-store servers to propagate information among each other too.
We generalize our result by considering a more general class of
systems called active stores, where request schedules do not only
include push and pull sets, but also propagation sets that are defined
as follows:
Definition 5 (Propagation sets) Each edge w u is associated
with a propagation set P
u
(w) V , which contains users who are
common subscribers of u and w. If the view of u stores for the first
time an event e produced by w, the data-store server pushes e to
the view of every user v P
u
(w).
We restrict the propagation of events to their subscribers to guar-
antee that a view only contains event from friends of the corre-
sponding user. We only consider active policies where data stores
take actions synchronously, when they receive requests. Some data
stores can push events asynchronously and periodically: all up-
dates received over the same period are accumulated and consid-
ered as a single update. Such schedules can be modeled as syn-
chronous schedules having an upper bound on the production rates,
determined based on the accumulation period and the communica-
tion latency between servers. Longer accumulation periods reduce
throughput cost but also increase staleness, which can be problem-
atic for highly interactive social networking applications.
The only difference between active and passive schedules is that
the formers can determine chains of pushes u w
1
. . . w
k
.
However, a chain of this form can be simulated in passive stores
by adding each edge u w
i
to H, resulting in lower or equal
latency and equal cost. This is formally shown by the following
equivalence result. The proof is omitted for lack of space.
Theorem 3 Any schedule of an active-propagation policy can be
simulated by a schedule of a passive-propagation policy with no
greater cost.
This result implies that we do not need to consider active propa-
gation in our analysis.
3. ALGORITHMS
This section introduces two algorithms to solve the DISSEMINA-
TION problem. We have shown that the problem is NP-hard, so
we propose an approximation algorithm, called CHITCHAT, and a
more scalable parallel heuristic, called PARALLELNOSY.
3.1 The CHITCHAT approximation algorithm
In this section we describe our approximation algorithm for the
DISSEMINATION problem, which we name CHITCHAT. Not sur-
prisingly, since the DISSEMINATION problem asks to find a sched-
ule that covers all the edges in the network, our algorithm is based
on the solution used for the SETCOVER problem.
For completeness we recall the SETCOVER problem: We are
given a ground set T and a collection C = {A
1
, . . . , A
m
} of sub-
sets of T , called candidates, such that
S
i
A
i
= T . Each set A
in C is associated with a cost c(A). The goal is to select a sub-
collection S C that covers all the elements in the ground set,
i.e.,
S
A∈S
A = T , and the total cost
P
A∈S
c(A) of the sets in the
collection S is minimized.
For the SETCOVER problem, the following simple greedy algo-
rithm is folklore [5]: Initialize S = to keep the iteratively grow-
ing solution, and Z = T to keep the uncovered elements of T .
Then as long as Z is not empty, select the set A C that mini-
mizes the cost per uncovered element
c(A)
|AZ|
, add the set A to the
412

X
Y
w
Figure 3: A hub-graph used in the mapping of DISSEMINATION
to SETCOVER problem. Solid edges must be served with a push
(if they point to w) or a pull (if they point from w). Dashed
edges are covered indirectly.
solution (S S {A}) and update the set of uncovered ele-
ments (Z Z \ A). It can be shown [5] that this greedy algorithm
achieves a solution with approximation guarantee O(log ∆), where
= max{|A|} is the size of the largest set in the collection C. At
the same time, this logarithmic guarantee is essentially the best one
can hope for, since Feige showed that the problem is not approx-
imable within (1 o(1)) ln n, unless NP has quasi-polynomial
time algorithms [7].
The goal of our SETCOVER variant is to identify request sched-
ules that optimize the DISSEMINATION problem. The ground set
to be covered consists of all edges in the social graph. The solution
space we identified in Section 2 indicates that the collection C con-
tains two kinds of subsets: edges that are served directly, and edges
that are served through a hub. Serving an edge u v E directly
through a push or a pull corresponds to covering using a singleton
subset {u v} C. The algorithm chooses between push and
pull according to the hybrid strategy of Silberstein et al. [11]. A
hub like the one of Figure 2(a) is a subset that covers three edges
using a push and a pull; the third edge is served indirectly. Every
time the algorithm selects a candidate from C, it adds the required
push and pull edges to the solution, the request schedule (H, L).
A straightforward application of the greedy algorithm described
above has exponential time complexity. The iterative step of the al-
gorithm must select a candidate from C, which has exponential car-
dinality because it contains all possible hubs. To our rescue comes a
well-known property about applying the greedy algorithm for solv-
ing the SETCOVER problem: a sufficient condition for applying the
greedy algorithm on SETCOVER is to have a polynomial-time or-
acle for selecting the set with the minimum cost-per-element. The
oracle can be invoked at every iterative step in order to find an (ap-
proximate) solution of the SETCOVER problem without materializ-
ing all elements of C. This makes the cardinality of C irrelevant.
The algorithmic challenge of CHITCHAT is finding a polynomial
time oracle for the DISSEMINATION problem. One key idea of
CHITCHAT is to split the oracle problem in two sub-problems, both
to be solved in polynomial time.
The first sub-problem is adding to C, for each node w, the hub-
graph centered on w that covers the largest number of edges for the
lowest cost. A hub-graph centered on w is a generalization of the
sub-graph of Figure 2(a), as depicted in Figure 3. It is a sub-graph
of the social graph where X is a set of nodes that w subscribes, and
Y is a set of nodes that subscribe to w. We refer to such hub-graphs
using the notation G(X, w, Y ).
The second sub-problem is selecting the best candidate of C.
This is now simple since C contains a linear number of hub-graph
elements and a quadratic number of singleton edges. If a hub-graph
is selected, the edges from all nodes in X to w are set to be push,
and the edges from w to all nodes in Y are set to be pull. All edges
between nodes of X and Y are covered indirectly.
The first sub-problem, finding the hub-graph centered in a given
node that covers most edges with lowest cost, is an interesting op-
timization problem in itself. In order to define the sub-problem,
we associate to each node u of a hub-graph a weight g(u) reflect-
ing the cost of u. We set g(x) = r
p
(x) for all x X, that is,
the cost of a push operation from x to w is associated to node x.
Similarly we associate the weight g(y) = r
c
(y) for each y Y .
For the hub node w, we set g(w) = 0. Let W and E(W ) be
the set of nodes and edges of the hub-graph, respectively, and let
g(W ) =
P
uW
g(u). The cost-per-element of the hub-graph is:
p(W ) =
g(W )
|E(W )|
. (1)
The sub-problem can thus be formulated as finding, for each node
w of the social graph, the hub-graph (W, E(W )) centered on w
that minimizes p(W ).
Careful inspection of Equation (1) motivates us to consider the
following problem.
Problem 2 (DENSESTSUBGRAPH) Let G = (V, E) be a graph.
For a set S V , E(S) denotes the set of edges of G between
nodes of S. The DENSESTSUBGRAPH problem asks to find the
subset S that maximizes the density function d(S) =
|E(S)|
|S|
.
If we weight the nodes of S using the g function define above,
we can obtain a weighted variant of this problem by replacing the
density function d(S) with d
w
(S) = |E(S)|/g(S).
Let G
w
be the largest hub-graph centered in a node w, the one
where X and Y include all producers and consumers of w, respec-
tively. Any subgraph (S, E(S)) of G
w
that maximizes d
w
(S) min-
imizes p(S). Therefore, any solution of the weighted version of
DENSESTSUBGRAPH will give us the hub-graph centered on w to
be included in C.
Interestingly, although many variants of dense-subgraph prob-
lems are NP-hard, Problem 2 can be solved exactly in polynomial
time. Given that we are looking for a solution of the SETCOVER
problem with a logarithmic approximation factor, we set for the
simple greedy algorithm analyzed by Asahiro et al. [1] and later
by Charikar [3]. This algorithm gives a 2-factor approximation for
Problem 2, and its running time is linear in the number of edges
in the graph. The algorithm is the following. Start with the whole
graph. Until left with an empty graph, iteratively remove the node
with the lowest degree (breaking ties arbitrarily) and all its incident
edges. Among all subgraphs considered during the execution of the
algorithm return the one with the maximum density.
The above algorithm works for the case that the density of a sub-
graph is d(S). In our case we want to maximize the weighted-
density function d
w
(S). Thus we modify the greedy algorithm of
Asahiro et al. and Charikar as follows. In each iteration, instead of
deleting the node with the lowest degree, we delete the node that
minimizes a notion of weighted degree, defined as d
g
(u) =
d(u)
g(u)
,
where d(u) is the normal notion of degree of node u. We can show
that this modified algorithm yields a factor-2 approximation for the
weighted version of the DENSESTSUBGRAPH problem.
Lemma 1 Given a graph G
w
= (S, E(S)), there exists a linear-
time algorithm solving the weighted variant of the DENSESTSUB-
GRAPH problem within an approximation factor of 2.
PROOF. We prove the lemma by modifying the analysis of Cha-
rikar [3]. Let f (S) =
E(S)
g(S)
be the objective function to optimize,
413

Citations
More filters
Posted Content
TL;DR: This work introduces the vertex interplay (VI) and edge inter play (EI) plots to characterize the interplay between core and truss decompositions, and devise Core-TrussDD, an anomaly detection algorithm that provides an efficient solution to retrieve the outliers in the networks, which correspond to the two anomalous behaviors.
Abstract: Finding the dense regions in a graph is an important problem in network analysis. Core decomposition and truss decomposition address this problem from two different perspectives. The former is a vertex-driven approach that assigns density indicators for vertices whereas the latter is an edge-driven technique that put density quantifiers on edges. Despite the algorithmic similarity between these two approaches, it is not clear how core and truss decompositions in a network are related. In this work, we introduce the vertex interplay (VI) and edge interplay (EI) plots to characterize the interplay between core and truss decompositions. Based on our observations, we devise CORE-TRUSSDD, an anomaly detection algorithm to identify the discrepancies between core and truss decompositions. We analyze a large and diverse set of real-world networks, and demonstrate how our approaches can be effective tools to characterize the patterns and anomalies in the networks. Through VI and EI plots, we observe distinct behaviors for graphs from different domains, and identify two anomalous behaviors driven by specific real-world structures. Our algorithm provides an efficient solution to retrieve the outliers in the networks, which correspond to the two anomalous behaviors. We believe that investigating the interplay between core and truss decompositions is important and can yield surprising insights regarding the dense subgraph structure of real-world networks.

2 citations


Cites methods from "Piggybacking on social networks"

  • ...Dense regions are also used to improve the efficiency of computation heavy tasks like distance query computation [7] and materialized per-user view creation [8]....

    [...]

Journal ArticleDOI
TL;DR: This tutorial helps researchers have a better understanding of existing densest subgraph models and solutions, but also provides them insights for future study.
Abstract: As one of the most fundamental problems in graph data mining, the densest subgraph discovery (DSD) problem has found a broad spectrum of real applications, such as social network community detection, graph index construction, regulatory motif discovery in DNA, fake follower detection, and so on. Theoretically, DSD closely relates to other fundamental graph problems, such as network flow and bipartite matching. Triggered by these applications and connections, DSD has garnered much attention from the database, data mining, theory, and network communities. In this tutorial, we first highlight the importance of DSD in various applications and the unique challenges that need to be addressed. Subsequently, we classify existing DSD solutions into several groups, which cover around 50 research papers published in many well-known venues (e.g., SIGMOD, PVLDB, TODS, WWW), and conduct a thorough review of these solutions in each group. Afterwards, we analyze and compare the models and solutions in these works. Finally, we point out a list of promising future research directions. We believe that this tutorial not only helps researchers have a better understanding of existing densest subgraph models and solutions, but also provides them insights for future study.

1 citations

Proceedings ArticleDOI
20 May 2018
TL;DR: This work proposes a novel scheme by exploiting the social community for event stream dissemination by fully exploiting the proposed hub- structure, based on the observation of high cluster coefficient in OSNs.
Abstract: In large-scale Online Social Network (OSN) systems, event stream dissemination incurs costly inter-server communication due to the per-user view data storage. To solve the problem, existing schemes commonly leverage the social graph structures to save redundant inter- server traffics across social links. The state-of-the- art social piggyback scheme reduces the inter-server traffics by fully exploiting the proposed hub- structure, based on the observation of high cluster coefficient in OSNs. In order to find the best hub- structure, however, such a scheme needs to identify the global densest sub-graph by iteratively removing the node with the minimum weighted degree. Such a process causes a worst computation cost of O(n²), making the social piggyback scheme unscalable for real-world large-scale OSN graphs. In this work, we propose a novel scheme by exploiting the social community for event stream dissemination. We first detect the social communities in a social graph by using an efficient community detection algorithm based on distance dynamics. For each community, we then design a heuristics algorithm to fully leverage the hub- structure. The heuristics algorithm explores the hub- structure center on the node with maximum degree in each iteration. We collect large-scale datasets from DBLP and Facebook and conduct comprehensive experiments to evaluate our design. The results show that our design significantly reduces the communication overhead and computing time by 40.71% and 81.21% compared to existing schemes, respectively.

Cites background or methods from "Piggybacking on social networks"

  • ...Gionis et al. demonstrate that finding the best piggyback assignment is NP-hard [7], due to the large solution space of assigning the push, the pull, and the piggyback strategy to all links in a large-scale social graph....

    [...]

  • ...We compare our algorithm with the state-of-the-art CHITCHAT algorithm [7] and the hybrid scheme [15]....

    [...]

  • ...[7], which takes full advantage of piggyback structure to save the communica-...

    [...]

  • ...The overall communication overhead [7] of the schedule (H,L) can be computed by Eq....

    [...]

  • ...[7] further propose the CHITCHAT algorithm to find the densest hub-structure with as many piggyback structures as...

    [...]

Journal ArticleDOI
TL;DR: In this paper , the authors studied the problem of finding the node set that is most likely to induce a densest subgraph in an uncertain graph and proposed sampling-based efficient algorithms to compute the MPDS.
Abstract: Computing the densest subgraph is a primitive graph operation with critical applications in detecting communities, events, and anomalies in biological, social, Web, and financial networks. In this paper, we study the novel problem of Most Probable Densest Subgraph (MPDS) discovery in uncertain graphs: Find the node set that is the most likely to induce a densest subgraph in an uncertain graph. We further extend our problem by considering various notions of density, e.g., clique and pattern densities, studying the top-k MPDSs, and finding the node set with the largest containment probability within densest subgraphs. We show that it is #P-hard to compute the probability of a node set inducing a densest subgraph. We then devise sampling-based efficient algorithms, with end-to-end accuracy guarantees, to compute the MPDS. Our thorough experimental results and real-world case studies on brain and social networks validate the effectiveness, efficiency, and usefulness of our solution.
Journal ArticleDOI
TL;DR: In this article , the authors examined the impact of social learning on consumption and production decisions in a societal context, and emphasized the need for a more rational and informed decision-making process in promoting a sustainable future.
Abstract: This study examines the impact of social learning on consumption and production decisions in a societal context. Individuals learn the actual value of nature through information and subsequent network communication, which is illustrated using the Directed Graph theory and DeGroot social learning process. In this context, individuals with greater access to private information are called "neighbours." Results suggest that in a perfectly rational scenario, individuals have high confidence in their abilities and base their decisions on a combination of personal experience, perception, and intellect; thus, society is expected to converge towards making responsible consumption choices $ {\mathrm{R}}_{\mathrm{c}}^{\mathrm{*}} $. However, when individuals are bounded or irrational, they exhibit persuasion bias or stubbornness, and diversity, independence, and decentralization are lacking. It leads to a situation where the consumption network lacks wisdom and may never result in responsible consumption choices. Thus finite, uniformly conspicuous neighbours will swiftly converge towards the opinion of the group. When a large proportion of individuals consume excessively (extravagance) or below the optimal level (misery), the consumption network is dominated by unwise decision-makers, leading to a society that prevents promoting sustainability. In conclusion, this study emphasizes the need for a more rational and informed decision-making process in promoting a sustainable future.
References
More filters
Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

20,309 citations


"Piggybacking on social networks" refers methods in this paper

  • ...PARALLELNOSY does not have the approximation guarantee of CHITCHAT, but it is a parallel algorithm that can be implemented as a MapReduce job and thus scales to real-size social graphs....

    [...]

  • ...Phase 2 is executed by the reduce phase of MapReduce, where each reducer receives all lock requests for a given edge u → v....

    [...]

  • ...We now describe in more detail the issues pertaining to the MapReduce implementation; we assume that the reader is familiar with the MapReduce architecture....

    [...]

  • ...Therefore, our implementation uses a pull approach and two MapReduce jobs: in the first job, hub-graphs having u → v as cross-edge send a notification to the hub-graphs centered in u and v saying that they are interested in updates to u → v. Updates for the edge are propagated only if they are indeed available....

    [...]

  • ...For the twitter graph, the amount of memory used by individual MapReduce workers exceeds in some cases the RAM capacity allocated to these workers, which is 1GB....

    [...]

Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.

17,663 citations

Journal ArticleDOI
TL;DR: Developments in this field are reviewed, including such concepts as the small-world effect, degree distributions, clustering, network correlations, random graph models, models of network growth and preferential attachment, and dynamical processes taking place on networks.
Abstract: Inspired by empirical studies of networked systems such as the Internet, social networks, and biological networks, researchers have in recent years developed a variety of techniques and models to help us understand or predict the behavior of these systems. Here we review developments in this field, including such concepts as the small-world effect, degree distributions, clustering, network correlations, random graph models, models of network growth and preferential attachment, and dynamical processes taking place on networks.

17,647 citations


"Piggybacking on social networks" refers background in this paper

  • ...Our main observation is that the high clustering coefficient of social networks implies the presence of many hubs, making hub-based schedules very efficient [10]....

    [...]

Proceedings Article
16 May 2010
TL;DR: An in-depth comparison of three measures of influence, using a large amount of data collected from Twitter, is presented, suggesting that topological measures such as indegree alone reveals very little about the influence of a user.
Abstract: Directed links in social media could represent anything from intimate friendships to common interests, or even a passion for breaking news or celebrity gossip. Such directed links determine the flow of information and hence indicate a user's influence on others — a concept that is crucial in sociology and viral marketing. In this paper, using a large amount of data collected from Twitter, we present an in-depth comparison of three measures of influence: indegree, retweets, and mentions. Based on these measures, we investigate the dynamics of user influence across topics and time. We make several interesting observations. First, popular users who have high indegree are not necessarily influential in terms of spawning retweets or mentions. Second, most influential users can hold significant influence over a variety of topics. Third, influence is not gained spontaneously or accidentally, but through concerted effort such as limiting tweets to a single topic. We believe that these findings provide new insights for viral marketing and suggest that topological measures such as indegree alone reveals very little about the influence of a user.

3,041 citations


Additional excerpts

  • ...[2]....

    [...]

Journal ArticleDOI
TL;DR: It is proved that (1 - o(1) ln n setcover is a threshold below which setcover cannot be approximated efficiently, unless NP has slightlysuperpolynomial time algorithms.
Abstract: Given a collection ℱ of subsets of S = {1,…,n}, set cover is the problem of selecting as few as possible subsets from ℱ such that their union covers S,, and max k-cover is the problem of selecting k subsets from ℱ such that their union has maximum cardinality. Both these problems are NP-hard. We prove that (1 - o(1)) ln n is a threshold below which set cover cannot be approximated efficiently, unless NP has slightly superpolynomial time algorithms. This closes the gap (up to low-order terms) between the ratio of approximation achievable by the greedy alogorithm (which is (1 - o(1)) ln n), and provious results of Lund and Yanakakis, that showed hardness of approximation within a ratio of (log2n) / 2 ≃0.72 ln n. For max k-cover, we show an approximation threshold of (1 - 1/e)(up to low-order terms), under assumption that P ≠ NP.

2,941 citations


"Piggybacking on social networks" refers background in this paper

  • ...At the same time, this logarithmic guarantee is essentially the best one can hope for, since Feige showed that the problem is not approximable within (1 − o(1)) lnn, unless NP has quasi-polynomial time algorithms [7]....

    [...]

Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "Piggybacking on social networks" ?

The authors propose to improve the throughput of these systems by using social piggybacking, which consists of processing the requests of two friends by querying and updating the view of a third common friend. The authors show that, given a social graph, social piggybacking can minimize the overall number of requests, but computing the optimal set of hubs is an NP-hard problem. The authors propose an O ( logn ) approximation algorithm and a heuristic to solve the problem, and evaluate them using the full Twitter and Flickr social graphs, which have up to billions of edges. The authors also evaluate their algorithms on a real social networking system prototype and they show that the actual increase in throughput corresponds nicely to the gain anticipated by their cost function.