Journal Article•DOI•

Piggybacking on social networks

Aristides Gionis¹, Flavio Junqueira², Vincent Leroy³, Marco Serafini⁴, Ingmar Weber⁴ - Show less +1 more•Institutions (4)

Aalto University¹, Microsoft², University of Grenoble³, Qatar Airways⁴

01 Apr 2013-Vol. 6, Iss: 6, pp 409-420

TL;DR: It is shown that, given a social graph, social piggybacking can minimize the overall number of requests, but computing the optimal set of hubs is an NP-hard problem, and an O(log n) approximation algorithm and a heuristic are proposed to solve the problem.

read less

Abstract: The popularity of social-networking sites has increased rapidly over the last decade. A basic functionalities of social-networking sites is to present users with streams of events shared by their friends. At a systems level, materialized per-user views are a common way to assemble and deliver such event streams on-line and with low latency. Access to the data stores, which keep the user views, is a major bottleneck of social-networking systems. We propose to improve the throughput of these systems by using social piggybacking, which consists of processing the requests of two friends by querying and updating the view of a third common friend. By using one such hub view, the system can serve requests of the first friend without querying or updating the view of the second. We show that, given a social graph, social piggybacking can minimize the overall number of requests, but computing the optimal set of hubs is an NP-hard problem. We propose an O(log n) approximation algorithm and a heuristic to solve the problem, and evaluate them using the full Twitter and Flickr social graphs, which have up to billions of edges. Compared to existing approaches, using social piggybacking results in similar throughput in systems with few servers, but enables substantial throughput improvements as the size of the system grows, reaching up to a 2-factor increase. We also evaluate our algorithms on a real social networking system prototype and we show that the actual increase in throughput corresponds nicely to the gain anticipated by our cost function.

...read moreread less

Summary (3 min read)

Jump to: [1. INTRODUCTION] – [2. SOCIAL DISSEMINATION PROBLEM] – [2.1 System model] – [2.2 Problem definition] – [3. ALGORITHMS] – [3.3 Incremental updates] – [4. EVALUATION] – [4.1 Input data] – [4.2 Social piggybacking on large social graphs] – [4.3 Prototype performance] – [4.4 The potential of social piggybacking] – [5. RELATED WORK] and [6. CONCLUSION]

Did you find this useful? Give us your feedback

Figures (9)

Figure 8: Load balancing – Query rate per server.

Figure 1: Simplified request flow for handling event streams in a social networking system. We focus on reducing the throughput cost of the most complex step: querying and updating data stores (shown with thick red arrows).

Figure 6: Actual per-client throughput of our prototype as a function of the number of servers. The first two lines have y axis on the left, the third on the right.

Figure 7: Predicted throughput as a function of the number of servers. The first two lines have y axis on the left, the third on the right.

Figure 2: Example of social piggybacking. Pushes are thick red arrows, pulls double green ones. (a) The edge from Art to Billie can be served through Charlie if Art pushes to Charlie and Billie pulls from Charlie. (b) Charlie’s view is a hub. Existing approaches unnecessarily issue one of the dashed requests.

Figure 9: Performance comparison of CHITCHAT and PARALLELNOSY on social graph samples.

Figure 4: Predicted improvement ratio of PARALLELNOSY.

Figure 5: Predicted improvement ratio of static and incremental PARALLELNOSY, starting from half flickr graph and adding increasingly large batches of new edges.

Figure 3: A hub-graph used in the mapping of DISSEMINATION to SETCOVER problem. Solid edges must be served with a push (if they point to w) or a pull (if they point from w). Dashed edges are covered indirectly.

Content maybe subject to copyright Report

Piggybacking on Social Networks

∗

Aristides Gionis

Aalto University and HIIT

Espoo, Finland

aristides.gionis@aalto.ﬁ

Flavio Junqueira

Microsoft Research

Cambridge, UK

fpj@microsoft.com

Vincent Leroy

Univ. of Grenoble – CNRS

Grenoble, France

vincent.leroy@imag.fr

Marco Seraﬁni

QCRI

Doha, Qatar

mseraﬁni@qf.org.qa

Ingmar Weber

QCRI

Doha, Qatar

ingmarweber@acm.org

ABSTRACT

The popularity of social-networking sites has increased rapidly over

the last decade. A basic functionalities of social-networking sites is

to present users with streams of events shared by their friends. At a

systems level, materialized per-user views are a common way to as-

semble and deliver such event streams on-line and with low latency.

Access to the data stores, which keep the user views, is a major bot-

tleneck of social-networking systems. We propose to improve the

throughput of these systems by using social piggybacking, which

consists of processing the requests of two friends by querying and

updating the view of a third common friend. By using one such

hub view, the system can serve requests of the ﬁrst friend with-

out querying or updating the view of the second. We show that,

given a social graph, social piggybacking can minimize the overall

number of requests, but computing the optimal set of hubs is an

NP-hard problem. We propose an O(log n) approximation algo-

rithm and a heuristic to solve the problem, and evaluate them using

the full Twitter and Flickr social graphs, which have up to billions

of edges. Compared to existing approaches, using social piggy-

backing results in similar throughput in systems with few servers,

but enables substantial throughput improvements as the size of the

system grows, reaching up to a 2-factor increase. We also evaluate

our algorithms on a real social networking system prototype and

we show that the actual increase in throughput corresponds nicely

to the gain anticipated by our cost function.

1. INTRODUCTION

Social networking sites have become highly popular in the past

few years. An increasing number of people use social network-

ing applications as a primary medium of ﬁnding new and inter-

esting information. Some of the most popular social networking

applications include services like Facebook, Twitter, Tumblr or Ya-

hoo! News Activity. In these applications, users establish connec-

tions with other users and share events: short text messages, URLs,

∗

Work conducted while the authors were with Yahoo! Research.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee. Articles from this volume were invited to present

their results at The 39th International Conference on Very Large Data Bases,

August 26th - 30th 2013, Riva del Garda, Trento, Italy.

Proceedings of the VLDB Endowment, Vol. 6, No. 6

photos, news stories, videos, and so on. Users can browse event

streams, real-time lists of recent events shared by their contacts,

on most social networking sites. A key peculiarity of social net-

working applications compared to traditional Web sites is that the

process of information dissemination is taking place in a many-

to-many fashion instead of the traditional few-to-many paradigm,

posing new system scalability challenges.

In this paper, we study the problem of assembling event streams,

which is the predominant workload of many social networking ap-

plications, e.g., 70% of the page views of Tumblr.

Assembling

of event streams needs to be on-line, to include the latest events for

every user, and very fast, as users expect the resulting event streams

to load in fractions of a second.

To put our work in context and to motivate our problem def-

inition, we describe the typical architecture of social networking

systems, and we discuss the process of assembling event streams.

We consider a system similar to the one depicted in Figure 1. In

such a system, information about users, the social graph, and events

shared by users are stored in back-end data stores. Users send re-

quests, such as sharing new events or receiving updates on their

event stream, to the social networking system through their browsers

or mobile apps.

A large social network with a very large number of active users

generates a massive workload. To handle this query workload and

optimize performance, the system uses materialized views. Views

are typically formed on a per-user basis, since each user sees a

different event stream. Views can contain events from a user’s

contacts and from the user itself. Our discussion is independent

of the implementation of the data stores; they could be relational

databases, key-value stores, or other data stores.

The throughput of the system is proportional to the data trans-

ferred to and from the data stores; therefore, increasing the data-

store throughput is a key problem in social networking systems.

In this paper, we propose optimization algorithms to reduce the

load induced on data stores—the thick red arrows in Figure 1. Our

algorithms make it possible to run the application using fewer data-

store servers or, equivalently, to increase throughput with the same

number of data-store servers.

Commercial social networking systems already use strategies to

send fewer requests to the data-store servers. A system can group

the views of the contacts of a user in two user-speciﬁc sets: the

push set, containing contact views that are updated by the data-

http://highscalability.com/blog/2012/2/13/tumblr-architecture-

15-billion-page-views-a-month-and-harder.html

http://www.facebook.com/note.php?note id=39391378919

409

!"#$%&'$()

*+'")

Social'networking'system'

(,%,)+%#"')-./'$%+)

0,11./-,2#$).#3/-4)

+#-/,.)3",15)

(,%,)+%#"'+)

0*+'")6/'7+4)

Figure 1: Simpliﬁed request ﬂow for handling event streams in

a social networking system. We focus on reducing the through-

put cost of the most complex step: querying and updating data

stores (shown with thick red arrows).

store clients when the user shares a new event, and the pull set, con-

taining contact views that are queried to assemble the user’s event

stream. The collection of push and pull sets for each user of the sys-

tem is called request schedule, and it has strong impact on perfor-

mance. Two standard request schedules are push-all and pull-all.

In push-all schedules, the push set contains all of user’s contacts,

while the pull set contains only the user’s own view. This schedule

is efﬁcient in read-dominated workloads because each query gen-

erates only one request. Pull-all schedules are specular, and are

better suited for write-dominated workloads. More efﬁcient sched-

ules can be identiﬁed by using a hybrid approach between pull- and

push-all, as proposed by Silberstein et al. [11]: for each pair of con-

tacts, choose between push and pull depending on how frequently

the two contacts share events and request event streams. This ap-

proach has been adopted, for example, by Tumblr.

In this paper we propose strictly cheaper schedules based on so-

cial piggybacking: the main idea is to process the requests of two

contacts by querying and updating the view of a third common con-

tact. Consider the example shown in Figure 2. For generality, we

model a social graph as a directed graph where a user may follow

another user, but the follow relationship is not necessarily symmet-

ric. In the example, Charlie’s view is in Art’s push set, so clients

insert every new event by Art into Charlie’s view. Consider now

that Billie follows both Art and Charlie. When Billie requests an

event stream, social piggybacking lets clients serving this request

pull Art’s updates from Charlie’s view, and so Charlie’s view acts

as a hub. Our main observation is that the high clustering coefﬁ-

cient of social networks implies the presence of many hubs, making

hub-based schedules very efﬁcient [10].

Social piggybacking generates fewer data-store requests than ap-

proaches based on push-all, pull-all, or hybrid schedules. With a

push-all schedule, the system pushes new events by Art to Billie’s

view—the dashed thick red arrow in Figure 2(b). With a pull-all

schedule, the system queries events from Art’s view whenever Bil-

lie requests a new event stream—the dashed double green arrow

in Figure 2(b). With a hybrid schedule, the system executes the

cheaper of these two operations. With social piggybacking, the

system does not execute any of them.

Using hubs in existing social networking architectures is very

simple: it just requires a careful conﬁguration of push and pull sets.

In this paper, we tackle the problem of calculating this conﬁgura-

tion, or in other words, the request schedule. The objective is to

minimize the overall rate of requests sent to views. We call this

problem the social-dissemination problem.

Our contribution is a comprehensive study of the problem of

social-dissemination. We ﬁrst show that optimal solutions of the

social-dissemination problem either use hubs (as Charlie in Fig-

ure 2) or, when efﬁcient hubs are not available, make pairs of users

exchange events by sending requests to their view directly. This

result reduces signiﬁcantly the space of solutions that need to be

explored, simplifying the analysis.

We show that computing optimal request schedules using hubs is

NP-hard, and we propose an approximation algorithm, which we

call CHITCHAT. The hardness of our problem comes from the set-

cover problem, and naturally, our approximation algorithm is based

on a greedy strategy and achieves an O(log n) guarantee. Apply-

ing the greedy strategy, however, is non-trivial, as the iterative step

of selecting the most cost-effective subset is itself an interesting op-

timization problem, which we solve by mapping it to the weighted

densest-subgraph problem.

We then develop a heuristic, named PARALLELNOSY, which can

be used for very large social networks. PARALLELNOSY does not

have the approximation guarantee of CHITCHAT, but it is a parallel

algorithm that can be implemented as a MapReduce job and thus

scales to real-size social graphs.

CHITCHAT and PARALLELNOSY assume that the graph is static;

however, using a simple incremental technique, request schedules

can be efﬁciently adapted when the social graph is modiﬁed. We

show that even if the social graph is dynamic, executing an initial

optimization pays off even after adding a large number of edges to

the graph, so it is not necessary to optimize the schedule frequently.

Evaluation on the full Twitter and Flickr graphs, which have bil-

lions of edges, shows that PARALLELNOSY schedules can improve

predicted throughput by a factor of up to 2 compared to the state-

of-the-art scheduling approach of Silberstein et al. [11].

Using a social networking system prototype, we show that the

actual throughput improvement using PARALLELNOSY schedules

compared to hybrid scheduling is signiﬁcant and matches very well

our predicted improvement. In small systems with few servers the

throughput is similar, but the throughput improvement grows with

the size of the system, becoming particularly signiﬁcant for large

social networking systems that use hundreds of servers to serve

millions, or even billions, of requests.

With 500 servers, PARAL-

LELNOSY increases the throughput of the prototype by about 20%;

with 1000 servers, the increase is about 35%; eventually, as the

number of server grows, the improvement approaches the predicted

2-factor increase previously discussed. In absolute terms, this may

mean processing millions of additional requests per second.

We also compare the performance of CHITCHAT and PARAL-

LELNOSY on large samples of the actual Twitter and Flickr graphs.

CHITCHAT signiﬁcantly outperforms PARALLELNOSY, showing

that there is potential for further improvements by making more

complex social piggybacking algorithms scalable.

Overall, we make the following contributions:

• Introducing the concept of social piggybacking, formalizing the

social dissemination problem, and showing its NP-hardness;

• Presenting the CHITCHAT approximation algorithm and show-

ing its O(log n) approximation bound;

• Presenting the PARALLELNOSY heuristic, which can be paral-

lelized and scaled to very large graphs;

• Evaluating the predicted throughput of PARALLELNOSY sched-

ules on full Twitter and Flickr graphs;

• Measuring actual throughput on a social networking system

prototype;

• Comparing CHITCHAT and PARALLELNOSY on samples of

the Twitter and Flickr graphs to explore possible further gains.

For an example, see: http://gigaom.com/2011/04/07/facebook-

this-is-what-webscale-looks-like/

410

Update'from'Art'

Query'from'Billie'

Data$store$

clients$

Art$

Charlie$

Billie$

Social$graph$

Data$stores$

(user$views)$

Art$

Charlie$

Billie$

(a)$ (b)$

Figure 2: Example of social piggybacking. Pushes are thick red

arrows, pulls double green ones. (a) The edge from Art to Bil-

lie can be served through Charlie if Art pushes to Charlie and

Billie pulls from Charlie. (b) Charlie’s view is a hub. Existing

approaches unnecessarily issue one of the dashed requests.

Roadmap. In Section 2 we discuss our model and present a formal

statement of the problem we consider. In Section 3 we present our

algorithms, which we evaluate in Section 4. We discuss the related

work in Section 5, and Section 6 concludes the work.

2. SOCIAL DISSEMINATION PROBLEM

We formalize the social-dissemination problem as a problem of

propagating events on a social graph. The goal is to efﬁciently

broadcast information from a user to its neighbors. Dissemination

must satisfy bounded staleness, a property modeling the require-

ment that event streams shall show events almost in real time. We

then show that the only request schedules satisfying bounded stal-

eness let each pair of users communicate either using direct push,

or direct pull, or social piggybacking. Finally, we analyze the com-

plexity of the social-dissemination problem and show that our re-

sults extend to more complex system models with active stores.

2.1 System model

We model the social graph as a directed graph G = (V, E). The

presence of an edge u → v in the social graph indicates that the

user v subscribes to the events produced by u. We will call u a

producer and v a consumer. Symmetric social relationships can be

modeled with two directed edges u → v and v → u.

A user can issue two types of requests: sharing an event, such as

a text message or a picture, and requesting an updated event stream,

a real-time list of recent events shared by the producers of the user.

For the purpose of our analysis, we do not distinguish between

nodes in the graph, the corresponding users, and their materialized

views. There is one view per user. A user view contains events

from the user itself and from the other users it subscribed to; send-

ing events to uninterested users results in unnecessary additional

throughput cost, which is the metric we want to minimize.

Deﬁnition 1 (View) A view is a set of events such that if an event

produced by user u is in the view of user v, then u = v or u →

v ∈ E.

Event streams and views consist of a ﬁnite list of events, ﬁltered

according to application-speciﬁc relevance criteria. Different ﬁlter-

ing criteria can be easily adapted in our framework; however, for

generality purposes, we do not explicitly consider ﬁltering criteria

but instead assume that all necessary past events are stored in views

and returned by queries.

A fundamental requirement for any feasible solution is that event

streams have bounded staleness: each event stream assembled for a

user u must contain every recent event shared by any producers of

u; the only events that are allowed to be missing are those shared

at most Θ time units ago. The speciﬁc value of the parameter Θ

may depend on various system parameters, such as the speed of

networks, CPUs, and external-memories, but it may also be a func-

tion of the current load of the system. The underlying motivation

of bounded staleness is that typical social applications must present

near real-time event streams, but small delays may be acceptable.

Deﬁnition 2 (Bounded staleness) There exists ﬁnite time bound Θ

such that, for each edge u → v ∈ E, any query action of v issued

at any time t in any execution returns every event posted by u in the

same execution at time t − Θ or before.

Note that the staleness of event streams is different from request

latency: a system might assemble event streams very quickly, but

they might contain very old events. Our work addresses the prob-

lem of request latency indirectly: improving throughput makes it

more likely to serve event streams with low latency.

In the system of Figure 2, the request schedule determines which

edges of the social graph are included in the push and pull sets of

any user. In our formal model, we consider two global pusH and

pulL sets, called H and L respectively, both subsets of the set of

edges E of the social graph. If a node u pushes events to a node

v in the model, this corresponds, in an actual system like the one

shown in Figure 2, to data-store clients updating the view of the

user v with all new events shared by user u whenever u shares them.

Similarly, if a node v pulls events from a node u, this corresponds

to data-store clients sending a query request to the view of the user

u whenever v requests its event stream. For simplicity, we assume

that users always access their own view with updates and queries.

Deﬁnition 3 (Request schedule) A request schedule is a pair

(H, L) of sets, with a push set H ⊆ E and a pull set L ⊆ E.

If v is in the push set of u, we say that u → v ∈ H. If u is in the

pull set of v, we say that u → v ∈ L.

It is important to note that all existing push-all, pull-all, and hy-

brid schedules described in Section 1 are sub-classes of the request

schedule class deﬁned above.

The goal of social dissemination is to obtain a request schedule

that minimizes the throughput cost induced by a workload on a

social networking system. We characterize the throughput cost of a

workload as the overall rate of query and updates it induces on data-

store servers. The workload is characterized by the production rate

(u) and the consumption rate r

(u) of each user u. These rates

indicate the average frequency with which users share new events

and request event streams, respectively. Given an edge u → v, the

cost incurred if u → v ∈ H is r

(u), because every time u shares

a new event, an update is sent to the view of v; similarly, the cost

incurred if u → v ∈ L is r

(v), because every event stream request

from v generates a query to the view of u.

The cost of the request schedule (H, L) is thus:

c(H, L) =

u→v∈H

(u) +

u→v∈L

(v).

This expression does not explicitly consider differences in the

cost of push and pull operations, modeling situations where the

messages generated by updates and queries are very small and have

similar cost. In order to model scenarios where the cost of a pull

operation is k times the cost of a push, independent of the speciﬁc

throughput metric we want to minimize (e.g., number of messages,

number of bytes transferred), it is sufﬁcient to multiply all con-

sumption rates by a factor k. Similarly, multiplying all production

411

rates by a factor k models systems where a push is more expensive

than a pull. Note that the cost of updating and querying a user’s own

view is not represented in the cost metric because it is implicit.

2.2 Problem deﬁnition

We now deﬁne the problem that we address in this paper.

Problem 1 (DISSEMINATION) Given a graph G = (V, E), and a

workload with production and consumption rates r

(u) and r

(u)

for each node u ∈ V , ﬁnd a request schedule (H , L) that guaran-

tees bounded staleness, while minimizing the cost c(H, L).

In this paper, we propose solving the DISSEMINATION problem

using social piggybacking, that is, making two nodes communicate

through a third common contact, called hub. Social piggybacking

is formally deﬁned as follows.

Deﬁnition 4 (Piggybacking) An edge u → v of a graph G(V, E)

is covered by piggybacking through a hub w ∈ V if there exists a

node w such that u → w ∈ E, w → v ∈ E, u → w ∈ H, and

w → v ∈ L.

Let ∆ be the upper bound on the time it takes for a system to

serve a user request. Piggybacking guarantees bounded staleness

with Θ = 2∆. In fact, it turns out that admissible schedules trans-

mit events over a social graph edge u → v only by pushing to v,

pulling from u, or using social piggybacking over a hub.

Theorem 1 Let (H, L) be a request schedule that guarantees

bounded staleness on a social graph G = (V, E). Then for each

edge u → v ∈ E, it holds that either (i) u → v ∈ H, or (ii)

u → v ∈ L, or (iii) u → v is covered by piggybacking through a

hub w ∈ V .

PROOF. As we already discussed, all three operations satisfy the

guarantee of bounded-time delivery. We will now argue that they

are the only three such operations.

Assume that the edge u → v is not served directly, but via a

path p = u → w

→ . . . → w

→ v. If the length of the

path p is 2, i.e., if k = 1, then simple enumeration of all cases for

paths of length 2 shows that social piggybacking is the only case

that satisﬁes bounded staleness in each execution. For example,

assume that both the edges u → w

and w

→ v are push edges.

Then, delivery of an event requires that user w

will take some

action within a certain time bound. However, since the user w

may remain idle for an arbitrarily long time, we cannot guarantee

bounded staleness.

For longer paths a similar argument holds. In particular, for paths

such that k > 1, the information has to propagate along some

edge w

→ w

i+1

. The information cannot propagate along the

edge w

→ w

i+1

without one of the users w

or w

i+1

taking an ac-

tion, and clearly we can assume that there exist executions in which

both w

or w

i+1

remain idle after u has posted an event and before

the next query of v.

Even considering only the solution space restricted by Theo-

rem 1, Problem 1 is NP-hard. The proof, which uses a reduction

from the SETCOVER problem, is omitted due to lack of space.

Theorem 2 The DISSEMINATION problem is NP-hard.

So far we have considered systems where data-store servers react

only to client operations. We can call data stores that only react to

user request passive stores. Some data-store middleware enables

data-store servers to propagate information among each other too.

We generalize our result by considering a more general class of

systems called active stores, where request schedules do not only

include push and pull sets, but also propagation sets that are deﬁned

as follows:

Deﬁnition 5 (Propagation sets) Each edge w → u is associated

with a propagation set P

(w) ⊆ V , which contains users who are

common subscribers of u and w. If the view of u stores for the ﬁrst

time an event e produced by w, the data-store server pushes e to

the view of every user v ∈ P

(w).

We restrict the propagation of events to their subscribers to guar-

antee that a view only contains event from friends of the corre-

sponding user. We only consider active policies where data stores

take actions synchronously, when they receive requests. Some data

stores can push events asynchronously and periodically: all up-

dates received over the same period are accumulated and consid-

ered as a single update. Such schedules can be modeled as syn-

chronous schedules having an upper bound on the production rates,

determined based on the accumulation period and the communica-

tion latency between servers. Longer accumulation periods reduce

throughput cost but also increase staleness, which can be problem-

atic for highly interactive social networking applications.

The only difference between active and passive schedules is that

the formers can determine chains of pushes u → w

→ . . . → w

However, a chain of this form can be simulated in passive stores

by adding each edge u → w

to H, resulting in lower or equal

latency and equal cost. This is formally shown by the following

equivalence result. The proof is omitted for lack of space.

Theorem 3 Any schedule of an active-propagation policy can be

simulated by a schedule of a passive-propagation policy with no

greater cost.

This result implies that we do not need to consider active propa-

gation in our analysis.

3. ALGORITHMS

This section introduces two algorithms to solve the DISSEMINA-

TION problem. We have shown that the problem is NP-hard, so

we propose an approximation algorithm, called CHITCHAT, and a

more scalable parallel heuristic, called PARALLELNOSY.

3.1 The CHITCHAT approximation algorithm

In this section we describe our approximation algorithm for the

DISSEMINATION problem, which we name CHITCHAT. Not sur-

prisingly, since the DISSEMINATION problem asks to ﬁnd a sched-

ule that covers all the edges in the network, our algorithm is based

on the solution used for the SETCOVER problem.

For completeness we recall the SETCOVER problem: We are

given a ground set T and a collection C = {A

, . . . , A

} of sub-

sets of T , called candidates, such that

= T . Each set A

in C is associated with a cost c(A). The goal is to select a sub-

collection S ⊆ C that covers all the elements in the ground set,

i.e.,

A∈S

A = T , and the total cost

A∈S

c(A) of the sets in the

collection S is minimized.

For the SETCOVER problem, the following simple greedy algo-

rithm is folklore [5]: Initialize S = ∅ to keep the iteratively grow-

ing solution, and Z = T to keep the uncovered elements of T .

Then as long as Z is not empty, select the set A ∈ C that mini-

mizes the cost per uncovered element

c(A)

|A∩Z|

, add the set A to the

412

Figure 3: A hub-graph used in the mapping of DISSEMINATION

to SETCOVER problem. Solid edges must be served with a push

(if they point to w) or a pull (if they point from w). Dashed

edges are covered indirectly.

solution (S ← S ∪ {A}) and update the set of uncovered ele-

ments (Z ← Z \ A). It can be shown [5] that this greedy algorithm

achieves a solution with approximation guarantee O(log ∆), where

∆ = max{|A|} is the size of the largest set in the collection C. At

the same time, this logarithmic guarantee is essentially the best one

can hope for, since Feige showed that the problem is not approx-

imable within (1 − o(1)) ln n, unless NP has quasi-polynomial

time algorithms [7].

The goal of our SETCOVER variant is to identify request sched-

ules that optimize the DISSEMINATION problem. The ground set

to be covered consists of all edges in the social graph. The solution

space we identiﬁed in Section 2 indicates that the collection C con-

tains two kinds of subsets: edges that are served directly, and edges

that are served through a hub. Serving an edge u → v ∈ E directly

through a push or a pull corresponds to covering using a singleton

subset {u → v} ∈ C. The algorithm chooses between push and

pull according to the hybrid strategy of Silberstein et al. [11]. A

hub like the one of Figure 2(a) is a subset that covers three edges

using a push and a pull; the third edge is served indirectly. Every

time the algorithm selects a candidate from C, it adds the required

push and pull edges to the solution, the request schedule (H, L).

A straightforward application of the greedy algorithm described

above has exponential time complexity. The iterative step of the al-

gorithm must select a candidate from C, which has exponential car-

dinality because it contains all possible hubs. To our rescue comes a

well-known property about applying the greedy algorithm for solv-

ing the SETCOVER problem: a sufﬁcient condition for applying the

greedy algorithm on SETCOVER is to have a polynomial-time or-

acle for selecting the set with the minimum cost-per-element. The

oracle can be invoked at every iterative step in order to ﬁnd an (ap-

proximate) solution of the SETCOVER problem without materializ-

ing all elements of C. This makes the cardinality of C irrelevant.

The algorithmic challenge of CHITCHAT is ﬁnding a polynomial

time oracle for the DISSEMINATION problem. One key idea of

CHITCHAT is to split the oracle problem in two sub-problems, both

to be solved in polynomial time.

The ﬁrst sub-problem is adding to C, for each node w, the hub-

graph centered on w that covers the largest number of edges for the

lowest cost. A hub-graph centered on w is a generalization of the

sub-graph of Figure 2(a), as depicted in Figure 3. It is a sub-graph

of the social graph where X is a set of nodes that w subscribes, and

Y is a set of nodes that subscribe to w. We refer to such hub-graphs

using the notation G(X, w, Y ).

The second sub-problem is selecting the best candidate of C.

This is now simple since C contains a linear number of hub-graph

elements and a quadratic number of singleton edges. If a hub-graph

is selected, the edges from all nodes in X to w are set to be push,

and the edges from w to all nodes in Y are set to be pull. All edges

between nodes of X and Y are covered indirectly.

The ﬁrst sub-problem, ﬁnding the hub-graph centered in a given

node that covers most edges with lowest cost, is an interesting op-

timization problem in itself. In order to deﬁne the sub-problem,

we associate to each node u of a hub-graph a weight g(u) reﬂect-

ing the cost of u. We set g(x) = r

(x) for all x ∈ X, that is,

the cost of a push operation from x to w is associated to node x.

Similarly we associate the weight g(y) = r

(y) for each y ∈ Y .

For the hub node w, we set g(w) = 0. Let W and E(W ) be

the set of nodes and edges of the hub-graph, respectively, and let

g(W ) =

u∈W

g(u). The cost-per-element of the hub-graph is:

p(W ) =

g(W )

|E(W )|

. (1)

The sub-problem can thus be formulated as ﬁnding, for each node

w of the social graph, the hub-graph (W, E(W )) centered on w

that minimizes p(W ).

Careful inspection of Equation (1) motivates us to consider the

following problem.

Problem 2 (DENSESTSUBGRAPH) Let G = (V, E) be a graph.

For a set S ⊆ V , E(S) denotes the set of edges of G between

nodes of S. The DENSESTSUBGRAPH problem asks to ﬁnd the

subset S that maximizes the density function d(S) =

|E(S)|

|S|

If we weight the nodes of S using the g function deﬁne above,

we can obtain a weighted variant of this problem by replacing the

density function d(S) with d

(S) = |E(S)|/g(S).

Let G

be the largest hub-graph centered in a node w, the one

where X and Y include all producers and consumers of w, respec-

tively. Any subgraph (S, E(S)) of G

that maximizes d

(S) min-

imizes p(S). Therefore, any solution of the weighted version of

DENSESTSUBGRAPH will give us the hub-graph centered on w to

be included in C.

Interestingly, although many variants of dense-subgraph prob-

lems are NP-hard, Problem 2 can be solved exactly in polynomial

time. Given that we are looking for a solution of the SETCOVER

problem with a logarithmic approximation factor, we set for the

simple greedy algorithm analyzed by Asahiro et al. [1] and later

by Charikar [3]. This algorithm gives a 2-factor approximation for

Problem 2, and its running time is linear in the number of edges

in the graph. The algorithm is the following. Start with the whole

graph. Until left with an empty graph, iteratively remove the node

with the lowest degree (breaking ties arbitrarily) and all its incident

edges. Among all subgraphs considered during the execution of the

algorithm return the one with the maximum density.

The above algorithm works for the case that the density of a sub-

graph is d(S). In our case we want to maximize the weighted-

density function d

(S). Thus we modify the greedy algorithm of

Asahiro et al. and Charikar as follows. In each iteration, instead of

deleting the node with the lowest degree, we delete the node that

minimizes a notion of weighted degree, deﬁned as d

(u) =

d(u)

g(u)

where d(u) is the normal notion of degree of node u. We can show

that this modiﬁed algorithm yields a factor-2 approximation for the

weighted version of the DENSESTSUBGRAPH problem.

Lemma 1 Given a graph G

= (S, E(S)), there exists a linear-

time algorithm solving the weighted variant of the DENSESTSUB-

GRAPH problem within an approximation factor of 2.

PROOF. We prove the lemma by modifying the analysis of Cha-

rikar [3]. Let f (S) =

E(S)

g(S)

be the objective function to optimize,

413

HTML Viewer

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "Piggybacking on social networks" ?

The authors propose to improve the throughput of these systems by using social piggybacking, which consists of processing the requests of two friends by querying and updating the view of a third common friend. The authors show that, given a social graph, social piggybacking can minimize the overall number of requests, but computing the optimal set of hubs is an NP-hard problem. The authors propose an O ( logn ) approximation algorithm and a heuristic to solve the problem, and evaluate them using the full Twitter and Flickr social graphs, which have up to billions of edges. The authors also evaluate their algorithms on a real social networking system prototype and they show that the actual increase in throughput corresponds nicely to the gain anticipated by their cost function.

Cites methods from "Piggybacking on social networks"

Cites background or methods from "Piggybacking on social networks"