TL;DR: It is shown that, given a social graph, social piggybacking can minimize the overall number of requests, but computing the optimal set of hubs is an NP-hard problem, and an O(log n) approximation algorithm and a heuristic are proposed to solve the problem.
Abstract: The popularity of social-networking sites has increased rapidly over the last decade. A basic functionalities of social-networking sites is to present users with streams of events shared by their friends. At a systems level, materialized per-user views are a common way to assemble and deliver such event streams on-line and with low latency. Access to the data stores, which keep the user views, is a major bottleneck of social-networking systems. We propose to improve the throughput of these systems by using social piggybacking, which consists of processing the requests of two friends by querying and updating the view of a third common friend. By using one such hub view, the system can serve requests of the first friend without querying or updating the view of the second. We show that, given a social graph, social piggybacking can minimize the overall number of requests, but computing the optimal set of hubs is an NP-hard problem. We propose an O(log n) approximation algorithm and a heuristic to solve the problem, and evaluate them using the full Twitter and Flickr social graphs, which have up to billions of edges. Compared to existing approaches, using social piggybacking results in similar throughput in systems with few servers, but enables substantial throughput improvements as the size of the system grows, reaching up to a 2-factor increase. We also evaluate our algorithms on a real social networking system prototype and we show that the actual increase in throughput corresponds nicely to the gain anticipated by our cost function.
Social networking sites have become highly popular in the past few years.
To put their work in context and to motivate their problem definition, the authors describe the typical architecture of social networking systems, and they discuss the process of assembling event streams.
The collection of push and pull sets for each user of the system is called request schedule, and it has strong impact on performance.
CHITCHAT and PARALLELNOSY assume that the graph is static; however, using a simple incremental technique, request schedules can be efficiently adapted when the social graph is modified.
2. SOCIAL DISSEMINATION PROBLEM
Dissemination must satisfy bounded staleness, a property modeling the requirement that event streams shall show events almost in real time.
The authors then show that the only request schedules satisfying bounded staleness let each pair of users communicate either using direct push, or direct pull, or social piggybacking.
Finally, the authors analyze the complexity of the social-dissemination problem and show that their results extend to more complex system models with active stores.
2.1 System model
For the purpose of their analysis, the authors do not distinguish between nodes in the graph, the corresponding users, and their materialized views.
Event streams and views consist of a finite list of events, filtered according to application-specific relevance criteria.
In the system of Figure 2, the request schedule determines which edges of the social graph are included in the push and pull sets of any user.
The workload is characterized by the production rate rp(u) and the consumption rate rc(u) of each user u.
Note that the cost of updating and querying a user’s own view is not represented in the cost metric because it is implicit.
2.2 Problem definition
The authors now define the problem that they address in this paper.
The authors propose solving the DISSEMINATION problem using social piggybacking, that is, making two nodes communicate through a third common contact, called hub.
Since the user w1 may remain idle for an arbitrarily long time, one cannot guarantee bounded staleness.the authors.
The authors can call data stores that only react to user request passive stores.
This is formally shown by the following equivalence result.
3. ALGORITHMS
This section introduces two algorithms to solve the DISSEMINATION problem.
Every time the algorithm selects a candidate from C, it adds the required push and pull edges to the solution, the request schedule (H,L).
Given that the authors are looking for a solution of the SETCOVER problem with a logarithmic approximation factor, they set for the simple greedy algorithm analyzed by Asahiro et al. [1] and later by Charikar [3].
The authors can show that this modified algorithm yields a factor-2 approximation for the weighted version of the DENSESTSUBGRAPH problem.
In the edge locking phase, each candidate hub-graph tries to lock its edges.
3.3 Incremental updates
PARALLELNOSY and CHITCHAT optimize a static social graph.
Over time, graph updates let the quality of the dissemination schedule degrade, so their algorithms can be executed periodically to re-optimize cost.
The experimental evaluation of Section 4 indicates that their algorithm does not need to be re-executed frequently.
4. EVALUATION
The authors evaluate the throughput performance of the proposed algorithm, contrasting it against the best available scheduling algorithm, the hybrid policy of Silberstein et al. [11].
The authors evaluation is both analytical, considering their cost metric of Section 2.1, and experimental, using measurements on a social networking system prototype.
The authors show that the PARALLELNOSY heuristic scales to real-world social graphs and doubles the throughput of social networking systems compared to hybrid schedules.
On a real prototype, PARALLELNOSY provides similar throughput as hybrid schedules when the system is composed by few servers; as the system grows, the throughput improvement becomes more evident, approaching the 2-factor analytical improvement.
The authors also evaluate the relative performance of the two proposed algorithms PARALLELNOSY and CHITCHAT.
4.1 Input data
The authors obtain datasets from two social graphs: flickr, as of April 2008, and twitter, as of August 2009.
The twitter graph has been made available by Cha et al. [2].
The authors algorithms also require input workloads: production and consumption rates for all the nodes in the network.
It has been observed by Huberman et al. that nodes with many followers tend to have a higher production rate, and nodes following many other nodes tend to have a higher consumption rate [8].
4.2 Social piggybacking on large social graphs
The authors run their MapReduce implementation of the PARALLELNOSY heuristic on the full twitter and flickr graphs.
As discussed in Section 3.2, very large social graphs may contain millions of cross-edges for a single hub-graph.
The authors quantify the performance of their algorithms by measuring their throughput compared against a baseline.
For both social graphs, the throughput of the PARALLELNOSY schedule increases sharply during the first iterations and it quickly stabilizes.
The larger stabilization time for twitter is due to the incremental detection of cross-edges at every cycle, as discussed before.
4.3 Prototype performance
In the previous section the authors evaluated their algorithms in terms of the predicted cost function that the algorithms optimize.
When processing a user query, application servers send at most one query per data store server s, which replies with a list of events filtered from all views v ∈ l stored by s, also known as The authors use batching.
Using data partitioning information as input of the DISSEMINATION problem is attractive, but has two main drawbacks.
The authors found that, if the network does not become a bottleneck, the overall throughput using n clients and n servers is about n times the perclient throughput with n servers.
Note that, since the y axis is logarithmic, the divergence between the algorithms and the error bars on the right side of the graph are magnified.
4.4 The potential of social piggybacking
The previous experiments show that PARALLELNOSY is an effective heuristic for real-world large-scale social networking systems.
In the experiments discussed below the authors use five graph samples; the plots report averages.
As for random-walk sampling, existing work has pointed out that it preserves certain clustering metrics; more precisely, in both the original and sampled graphs, nodes with the same degree have similar ratio of actual and potential edges between their neighbors [9].
This reduces the relative gain of social piggybacking since the hybrid schedule of Silberstein et al. (our baseline) uses per-edge optimizations that do not depend on the degree of nodes.
5. RELATED WORK
Similar to the MIN-COST problem of Silberstein et al. [11], their DISSEMINATION problem takes as input the consumption and production rates of users, together with the social network, and uses these rates in the definition of the cost function.
This enables taking advantage of the high clustering coefficient of social graphs and leads to substantial gains, as shown by their evaluation.
Users contact only their own views for queries.
In terms of throughput cost, SPAR uses an push-all schedule (see Section 1), which, as shown in [11], is never more efficient than the hybrid schedule the authors used as their baseline.
In their case, the social graph is given, the mapping of users to physical servers is not known, the authors minimize cost based on scheduling decisions and production and consumption rates, and they consider the additional bounded staleness constraint.
6. CONCLUSION
Assembling and delivering event streams is a major feature of social networking systems and imposes a heavy load on back-end data stores.
The authors proposed two algorithms to compute request schedules that leverage social piggybacking.
The CHITCHAT algorithm is an approximation algorithm that uses a novel combination of the SETCOVER and DENSESTSUBGRAPH and has an approximation factor of O(lnn).
The PARALLELNOSY heuristic is a parallel algorithm that can scale to large social graphs.
In small systems, the authors obtained similar throughput as existing hybrid approaches, but as the size of the system grows beyond a few hundreds of servers, the throughput grows significantly, reaching a limit of a 2-factor improvement.
TL;DR: An interesting consequence of this work is that triangle counting, a well-studied computational problem in the context of social network analysis can be used to detect large near-cliques.
Abstract: Numerous graph mining applications rely on detecting subgraphs which are large near-cliques. Since formulations that are geared towards finding large near-cliques are hard and frequently inapproximable due to connections with the Maximum Clique problem, the poly-time solvable densest subgraph problem which maximizes the average degree over all possible subgraphs "lies at the core of large scale data mining" [10]. However, frequently the densest subgraph problem fails in detecting large near-cliques in networks. In this work, we introduce the k-clique densest subgraph problem, k ≥ 2. This generalizes the well studied densest subgraph problem which is obtained as a special case for k=2. For k=3 we obtain a novel formulation which we refer to as the triangle densest subgraph problem: given a graph G(V,E), find a subset of vertices S* such that τ(S*)=max limitsS ⊆ V t(S)/|S|, where t(S) is the number of triangles induced by the set S. On the theory side, we prove that for any k constant, there exist an exact polynomial time algorithm for the k-clique densest subgraph problem}. Furthermore, we propose an efficient 1/k-approximation algorithm which generalizes the greedy peeling algorithm of Asahiro and Charikar [8,18] for k=2. Finally, we show how to implement efficiently this peeling framework on MapReduce for any k ≥ 3, generalizing the work of Bahmani, Kumar and Vassilvitskii for the case k=2 [10]. On the empirical side, our two main findings are that (i) the triangle densest subgraph is consistently closer to being a large near-clique compared to the densest subgraph and (ii) the peeling approximation algorithms for both k=2 and k=3 achieve on real-world networks approximation ratios closer to 1 rather than the pessimistic 1/k guarantee. An interesting consequence of our work is that triangle counting, a well-studied computational problem in the context of social network analysis can be used to detect large near-cliques. Finally, we evaluate our proposed method on a popular graph mining application.
218 citations
Cites background from "Piggybacking on social networks"
...The DS-Problem is a powerful primitive for many graph applications including social piggybacking [27], reachability and distance query indexing [19, 37]....
TL;DR: The nucleus decomposition of a graph as discussed by the authors generalizes the classic notions of k-cores and k-truss decompositions and provides a global, hierarchical snapshot of dense substructures, and outputs dense subgraphs of higher quality than other state-of-the-art solutions.
Abstract: Finding dense substructures in a graph is a fundamental graph mining operation, with applications in bioinformatics, social networks, and visualization to name a few Yet most standard formulations of this problem (like clique, quasiclique, k-densest subgraph) are NP-hard Furthermore, the goal is rarely to find the "true optimum", but to identify many (if not all) dense substructures, understand their distribution in the graph, and ideally determine relationships among them Current dense subgraph finding algorithms usually optimize some objective, and only find a few such subgraphs without providing any structural relations We define the nucleus decomposition of a graph, which represents the graph as a forest of nuclei Each nucleus is a subgraph where smaller cliques are present in many larger cliques The forest of nuclei is a hierarchy by containment, where the edge density increases as we proceed towards leaf nuclei Sibling nuclei can have limited intersections, which enables discovering overlapping dense subgraphs With the right parameters, the nucleus decomposition generalizes the classic notions of k-cores and k-truss decompositions We give provably efficient algorithms for nucleus decompositions, and empirically evaluate their behavior in a variety of real graphs The tree of nuclei consistently gives a global, hierarchical snapshot of dense substructures, and outputs dense subgraphs of higher quality than other state-of-the-art solutions Our algorithm can process graphs with tens of millions of edges in less than an hour
TL;DR: The nucleus decomposition of a graph is defined, which represents the graph as a forest of nuclei, and provably efficient algorithms for nucleus decompositions are given, and empirically evaluate their behavior in a variety of real graphs.
Abstract: Finding dense substructures in a graph is a fundamental graph mining operation, with applications in bioinformatics, social networks, and visualization to name a few. Yet most standard formulations of this problem (like clique, quasiclique, k-densest subgraph) are NP-hard. Furthermore, the goal is rarely to find the "true optimum", but to identify many (if not all) dense substructures, understand their distribution in the graph, and ideally determine relationships among them. Current dense subgraph finding algorithms usually optimize some objective, and only find a few such subgraphs without providing any structural relations. We define the nucleus decomposition of a graph, which represents the graph as a forest of nuclei. Each nucleus is a subgraph where smaller cliques are present in many larger cliques. The forest of nuclei is a hierarchy by containment, where the edge density increases as we proceed towards leaf nuclei. Sibling nuclei can have limited intersections, which enables discovering overlapping dense subgraphs. With the right parameters, the nucleus decomposition generalizes the classic notions of k-cores and k-truss decompositions. We give provably efficient algorithms for nucleus decompositions, and empirically evaluate their behavior in a variety of real graphs. The tree of nuclei consistently gives a global, hierarchical snapshot of dense substructures, and outputs dense subgraphs of higher quality than other state-of-the-art solutions. Our algorithm can process graphs with tens of millions of edges in less than an hour.
89 citations
Cites methods from "Piggybacking on social networks"
...It has been used for finding communities and spam link farms in web graphs [29, 20, 13], graph visualization [2], real-time story identification [4], DNA motif detection in biological networks [18], finding correlated genes [49], epilepsy prediction [26], finding price value motifs in financial data [14], graph compression [8], distance query indexing [27], and increasing the throughput of social networking site servers [21]....
TL;DR: A new on-line partitioning approach, called Clay, that supports both tree-based schemas and more complex "general" schemas with arbitrary foreign key relationships is presented and it is shown that it can generate partitioning schemes that enable the system to achieve up to 15× better throughput and 99% lower latency than existing approaches.
Abstract: Transaction processing database management systems (DBMSs) are critical for today's data-intensive applications because they enable an organization to quickly ingest and query new information. Many of these applications exceed the capabilities of a single server, and thus their database has to be deployed in a distributed DBMS. The key factor affecting such a system's performance is how the database is partitioned. If the database is partitioned incorrectly, the number of distributed transactions can be high. These transactions have to synchronize their operations over the network, which is considerably slower and leads to poor performance. Previous work on elastic database repartitioning has focused on a certain class of applications whose database schema can be represented in a hierarchical tree structure. But many applications cannot be partitioned in this manner, and thus are subject to distributed transactions that impede their performance and scalability.In this paper, we present a new on-line partitioning approach, called Clay, that supports both tree-based schemas and more complex "general" schemas with arbitrary foreign key relationships. Clay dynamically creates blocks of tuples to migrate among servers during repartitioning, placing no constraints on the schema but taking care to balance load and reduce the amount of data migrated. Clay achieves this goal by including in each block a set of hot tuples and other tuples co-accessed with these hot tuples. To evaluate our approach, we integrate Clay in a distributed, main-memory DBMS and show that it can generate partitioning schemes that enable the system to achieve up to 15× better throughput and 99% lower latency than existing approaches.
82 citations
Cites result from "Piggybacking on social networks"
...The fact that a user’s frequency of activity in a social network is related to their degrees has been observed in previous work [13, 10]....
TL;DR: In this article, a new solution paradigm was proposed to find the densest subgraphs through a k-core (a kind of dense subgraph of a graph) with theoretical guarantees.
Abstract: Densest subgraph discovery (DSD) is a fundamental problem in graph mining. It has been studied for decades, and is widely used in various areas, including network science, biological analysis, and graph databases. Given a graph G, DSD aims to find a subgraph D of G with the highest density (e.g., the number of edges over the number of vertices in D). Because DSD is difficult to solve, we propose a new solution paradigm in this paper. Our main observation is that the densest subgraph can be accurately found through a k-core (a kind of dense subgraph of G), with theoretical guarantees. Based on this intuition, we develop efficient exact and approximation solutions for DSD. Moreover, our solutions are able to find the densest subgraphs for a wide range of graph density definitions, including clique-based- and general pattern-based density. We have performed extensive experimental evaluation on both real and synthetic datasets. Our results show that our algorithms are up to four orders of magnitude faster than existing approaches.
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.
Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.
20,309 citations
"Piggybacking on social networks" refers methods in this paper
...PARALLELNOSY does not have the approximation guarantee of CHITCHAT, but it is a parallel algorithm that can be implemented as a MapReduce job and thus scales to real-size social graphs....
[...]
...Phase 2 is executed by the reduce phase of MapReduce, where each reducer receives all lock requests for a given edge u → v....
[...]
...We now describe in more detail the issues pertaining to the MapReduce implementation; we assume that the reader is familiar with the MapReduce architecture....
[...]
...Therefore, our implementation uses a pull approach and two MapReduce jobs: in the first job, hub-graphs
having u → v as cross-edge send a notification to the hub-graphs centered in u and v saying that they are interested in updates to u → v. Updates for the edge are propagated only if they are indeed available....
[...]
...For the twitter graph, the amount of memory used by individual MapReduce workers exceeds in some cases the RAM capacity allocated to these workers, which is 1GB....
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.
TL;DR: Developments in this field are reviewed, including such concepts as the small-world effect, degree distributions, clustering, network correlations, random graph models, models of network growth and preferential attachment, and dynamical processes taking place on networks.
Abstract: Inspired by empirical studies of networked systems such as the Internet, social networks, and biological networks, researchers have in recent years developed a variety of techniques and models to help us understand or predict the behavior of these systems. Here we review developments in this field, including such concepts as the small-world effect, degree distributions, clustering, network correlations, random graph models, models of network growth and preferential attachment, and dynamical processes taking place on networks.
17,647 citations
"Piggybacking on social networks" refers background in this paper
...Our main observation is that the high clustering coefficient of social networks implies the presence of many hubs, making hub-based schedules very efficient [10]....
TL;DR: An in-depth comparison of three measures of influence, using a large amount of data collected from Twitter, is presented, suggesting that topological measures such as indegree alone reveals very little about the influence of a user.
Abstract: Directed links in social media could represent anything from intimate friendships to common interests, or even a passion for breaking news or celebrity gossip. Such directed links determine the flow of information and hence indicate a user's influence on others — a concept that is crucial in sociology and viral marketing. In this paper, using a large amount of data collected from Twitter, we present an in-depth comparison of three measures of influence: indegree, retweets, and mentions. Based on these measures, we investigate the dynamics of user influence across topics and time. We make several interesting observations. First, popular users who have high indegree are not necessarily influential in terms of spawning retweets or mentions. Second, most influential users can hold significant influence over a variety of topics. Third, influence is not gained spontaneously or accidentally, but through concerted effort such as limiting tweets to a single topic. We believe that these findings provide new insights for viral marketing and suggest that topological measures such as indegree alone reveals very little about the influence of a user.
TL;DR: It is proved that (1 - o(1) ln n setcover is a threshold below which setcover cannot be approximated efficiently, unless NP has slightlysuperpolynomial time algorithms.
Abstract: Given a collection ℱ of subsets of S = {1,…,n}, set cover is the problem of selecting as few as possible subsets from ℱ such that their union covers S,, and max k-cover is the problem of selecting k subsets from ℱ such that their union has maximum cardinality. Both these problems are NP-hard. We prove that (1 - o(1)) ln n is a threshold below which set cover cannot be approximated efficiently, unless NP has slightly superpolynomial time algorithms. This closes the gap (up to low-order terms) between the ratio of approximation achievable by the greedy alogorithm (which is (1 - o(1)) ln n), and provious results of Lund and Yanakakis, that showed hardness of approximation within a ratio of (log2n) / 2 ≃0.72 ln n. For max k-cover, we show an approximation threshold of (1 - 1/e)(up to low-order terms), under assumption that P ≠ NP.
2,941 citations
"Piggybacking on social networks" refers background in this paper
...At the same time, this logarithmic guarantee is essentially the best one can hope for, since Feige showed that the problem is not approximable within (1 − o(1)) lnn, unless NP has quasi-polynomial time algorithms [7]....
Q1. What contributions have the authors mentioned in the paper "Piggybacking on social networks" ?
The authors propose to improve the throughput of these systems by using social piggybacking, which consists of processing the requests of two friends by querying and updating the view of a third common friend. The authors show that, given a social graph, social piggybacking can minimize the overall number of requests, but computing the optimal set of hubs is an NP-hard problem. The authors propose an O ( logn ) approximation algorithm and a heuristic to solve the problem, and evaluate them using the full Twitter and Flickr social graphs, which have up to billions of edges. The authors also evaluate their algorithms on a real social networking system prototype and they show that the actual increase in throughput corresponds nicely to the gain anticipated by their cost function.