Piggybacking on social networks
Summary (3 min read)
- Social networking sites have become highly popular in the past few years.
- To put their work in context and to motivate their problem definition, the authors describe the typical architecture of social networking systems, and they discuss the process of assembling event streams.
- The collection of push and pull sets for each user of the system is called request schedule, and it has strong impact on performance.
- CHITCHAT and PARALLELNOSY assume that the graph is static; however, using a simple incremental technique, request schedules can be efficiently adapted when the social graph is modified.
2. SOCIAL DISSEMINATION PROBLEM
- Dissemination must satisfy bounded staleness, a property modeling the requirement that event streams shall show events almost in real time.
- The authors then show that the only request schedules satisfying bounded staleness let each pair of users communicate either using direct push, or direct pull, or social piggybacking.
- Finally, the authors analyze the complexity of the social-dissemination problem and show that their results extend to more complex system models with active stores.
2.1 System model
- For the purpose of their analysis, the authors do not distinguish between nodes in the graph, the corresponding users, and their materialized views.
- Event streams and views consist of a finite list of events, filtered according to application-specific relevance criteria.
- In the system of Figure 2, the request schedule determines which edges of the social graph are included in the push and pull sets of any user.
- The workload is characterized by the production rate rp(u) and the consumption rate rc(u) of each user u.
- Note that the cost of updating and querying a user’s own view is not represented in the cost metric because it is implicit.
2.2 Problem definition
- The authors now define the problem that they address in this paper.
- The authors propose solving the DISSEMINATION problem using social piggybacking, that is, making two nodes communicate through a third common contact, called hub.
- Since the user w1 may remain idle for an arbitrarily long time, one cannot guarantee bounded staleness.the authors.
- The authors can call data stores that only react to user request passive stores.
- This is formally shown by the following equivalence result.
- This section introduces two algorithms to solve the DISSEMINATION problem.
- Every time the algorithm selects a candidate from C, it adds the required push and pull edges to the solution, the request schedule (H,L).
- Given that the authors are looking for a solution of the SETCOVER problem with a logarithmic approximation factor, they set for the simple greedy algorithm analyzed by Asahiro et al.  and later by Charikar .
- The authors can show that this modified algorithm yields a factor-2 approximation for the weighted version of the DENSESTSUBGRAPH problem.
- In the edge locking phase, each candidate hub-graph tries to lock its edges.
3.3 Incremental updates
- PARALLELNOSY and CHITCHAT optimize a static social graph.
- Over time, graph updates let the quality of the dissemination schedule degrade, so their algorithms can be executed periodically to re-optimize cost.
- The experimental evaluation of Section 4 indicates that their algorithm does not need to be re-executed frequently.
- The authors evaluate the throughput performance of the proposed algorithm, contrasting it against the best available scheduling algorithm, the hybrid policy of Silberstein et al. .
- The authors evaluation is both analytical, considering their cost metric of Section 2.1, and experimental, using measurements on a social networking system prototype.
- The authors show that the PARALLELNOSY heuristic scales to real-world social graphs and doubles the throughput of social networking systems compared to hybrid schedules.
- On a real prototype, PARALLELNOSY provides similar throughput as hybrid schedules when the system is composed by few servers; as the system grows, the throughput improvement becomes more evident, approaching the 2-factor analytical improvement.
- The authors also evaluate the relative performance of the two proposed algorithms PARALLELNOSY and CHITCHAT.
4.1 Input data
- The authors obtain datasets from two social graphs: flickr, as of April 2008, and twitter, as of August 2009.
- The twitter graph has been made available by Cha et al. .
- The authors algorithms also require input workloads: production and consumption rates for all the nodes in the network.
- It has been observed by Huberman et al. that nodes with many followers tend to have a higher production rate, and nodes following many other nodes tend to have a higher consumption rate .
4.2 Social piggybacking on large social graphs
- The authors run their MapReduce implementation of the PARALLELNOSY heuristic on the full twitter and flickr graphs.
- As discussed in Section 3.2, very large social graphs may contain millions of cross-edges for a single hub-graph.
- The authors quantify the performance of their algorithms by measuring their throughput compared against a baseline.
- For both social graphs, the throughput of the PARALLELNOSY schedule increases sharply during the first iterations and it quickly stabilizes.
- The larger stabilization time for twitter is due to the incremental detection of cross-edges at every cycle, as discussed before.
4.3 Prototype performance
- In the previous section the authors evaluated their algorithms in terms of the predicted cost function that the algorithms optimize.
- When processing a user query, application servers send at most one query per data store server s, which replies with a list of events filtered from all views v ∈ l stored by s, also known as The authors use batching.
- Using data partitioning information as input of the DISSEMINATION problem is attractive, but has two main drawbacks.
- The authors found that, if the network does not become a bottleneck, the overall throughput using n clients and n servers is about n times the perclient throughput with n servers.
- Note that, since the y axis is logarithmic, the divergence between the algorithms and the error bars on the right side of the graph are magnified.
4.4 The potential of social piggybacking
- The previous experiments show that PARALLELNOSY is an effective heuristic for real-world large-scale social networking systems.
- In the experiments discussed below the authors use five graph samples; the plots report averages.
- As for random-walk sampling, existing work has pointed out that it preserves certain clustering metrics; more precisely, in both the original and sampled graphs, nodes with the same degree have similar ratio of actual and potential edges between their neighbors .
- This reduces the relative gain of social piggybacking since the hybrid schedule of Silberstein et al. (our baseline) uses per-edge optimizations that do not depend on the degree of nodes.
5. RELATED WORK
- Similar to the MIN-COST problem of Silberstein et al. , their DISSEMINATION problem takes as input the consumption and production rates of users, together with the social network, and uses these rates in the definition of the cost function.
- This enables taking advantage of the high clustering coefficient of social graphs and leads to substantial gains, as shown by their evaluation.
- Users contact only their own views for queries.
- In terms of throughput cost, SPAR uses an push-all schedule (see Section 1), which, as shown in , is never more efficient than the hybrid schedule the authors used as their baseline.
- In their case, the social graph is given, the mapping of users to physical servers is not known, the authors minimize cost based on scheduling decisions and production and consumption rates, and they consider the additional bounded staleness constraint.
- Assembling and delivering event streams is a major feature of social networking systems and imposes a heavy load on back-end data stores.
- The authors proposed two algorithms to compute request schedules that leverage social piggybacking.
- The CHITCHAT algorithm is an approximation algorithm that uses a novel combination of the SETCOVER and DENSESTSUBGRAPH and has an approximation factor of O(lnn).
- The PARALLELNOSY heuristic is a parallel algorithm that can scale to large social graphs.
- In small systems, the authors obtained similar throughput as existing hybrid approaches, but as the size of the system grows beyond a few hundreds of servers, the throughput grows significantly, reaching a limit of a 2-factor improvement.
Did you find this useful? Give us your feedback
Cites background from "Piggybacking on social networks"
...The DS-Problem is a powerful primitive for many graph applications including social piggybacking , reachability and distance query indexing [19, 37]....
Cites methods from "Piggybacking on social networks"
...It has been used for finding communities and spam link farms in web graphs [29, 20, 13], graph visualization , real-time story identification , DNA motif detection in biological networks , finding correlated genes , epilepsy prediction , finding price value motifs in financial data , graph compression , distance query indexing , and increasing the throughput of social networking site servers ....
Cites result from "Piggybacking on social networks"
...The fact that a user’s frequency of activity in a social network is related to their degrees has been observed in previous work [13, 10]....
"Piggybacking on social networks" refers methods in this paper
...PARALLELNOSY does not have the approximation guarantee of CHITCHAT, but it is a parallel algorithm that can be implemented as a MapReduce job and thus scales to real-size social graphs....
...Phase 2 is executed by the reduce phase of MapReduce, where each reducer receives all lock requests for a given edge u → v....
...We now describe in more detail the issues pertaining to the MapReduce implementation; we assume that the reader is familiar with the MapReduce architecture....
...Therefore, our implementation uses a pull approach and two MapReduce jobs: in the first job, hub-graphs having u → v as cross-edge send a notification to the hub-graphs centered in u and v saying that they are interested in updates to u → v. Updates for the edge are propagated only if they are indeed available....
...For the twitter graph, the amount of memory used by individual MapReduce workers exceeds in some cases the RAM capacity allocated to these workers, which is 1GB....
"Piggybacking on social networks" refers background in this paper
...Our main observation is that the high clustering coefficient of social networks implies the presence of many hubs, making hub-based schedules very efficient ....
"Piggybacking on social networks" refers background in this paper
...At the same time, this logarithmic guarantee is essentially the best one can hope for, since Feige showed that the problem is not approximable within (1 − o(1)) lnn, unless NP has quasi-polynomial time algorithms ....
Related Papers (5)
Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "Piggybacking on social networks" ?
The authors propose to improve the throughput of these systems by using social piggybacking, which consists of processing the requests of two friends by querying and updating the view of a third common friend. The authors show that, given a social graph, social piggybacking can minimize the overall number of requests, but computing the optimal set of hubs is an NP-hard problem. The authors propose an O ( logn ) approximation algorithm and a heuristic to solve the problem, and evaluate them using the full Twitter and Flickr social graphs, which have up to billions of edges. The authors also evaluate their algorithms on a real social networking system prototype and they show that the actual increase in throughput corresponds nicely to the gain anticipated by their cost function.