scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Social and Information Networks in 2012"


Journal ArticleDOI
TL;DR: It is proved that pseudo-likelihood provides consistent estimates of the communities under a mild condition on the starting value, for the case of a block model with two communities.
Abstract: Many algorithms have been proposed for fitting network models with communities, but most of them do not scale well to large networks, and often fail on sparse networks. Here we propose a new fast pseudo-likelihood method for fitting the stochastic block model for networks, as well as a variant that allows for an arbitrary degree distribution by conditioning on degrees. We show that the algorithms perform well under a range of settings, including on very sparse networks, and illustrate on the example of a network of political blogs. We also propose spectral clustering with perturbations, a method of independent interest, which works well on sparse networks where regular spectral clustering fails, and use it to provide an initial value for pseudo-likelihood. We prove that pseudo-likelihood provides consistent estimates of the communities under a mild condition on the starting value, for the case of a block model with two communities.

269 citations


Posted Content
TL;DR: In this paper, a measure of signal strength between two nodes in the network by fusing their link strength with content similarity is introduced, based on whether the link is likely (with high probability) to reside within a community.
Abstract: In this paper we discuss a very simple approach of combining content and link information in graph structures for the purpose of community discovery, a fundamental task in network analysis. Our approach hinges on the basic intuition that many networks contain noise in the link structure and that content information can help strengthen the community signal. This enables ones to eliminate the impact of noise (false positives and false negatives), which is particularly prevalent in online social networks and Web-scale information networks. Specifically we introduce a measure of signal strength between two nodes in the network by fusing their link strength with content similarity. Link strength is estimated based on whether the link is likely (with high probability) to reside within a community. Content similarity is estimated through cosine similarity or Jaccard coefficient. We discuss a simple mechanism for fusing content and link similarity. We then present a biased edge sampling procedure which retains edges that are locally relevant for each graph node. The resulting backbone graph can be clustered using standard community discovery algorithms such as Metis and Markov clustering. Through extensive experiments on multiple real-world datasets (Flickr, Wikipedia and CiteSeer) with varying sizes and characteristics, we demonstrate the effectiveness and efficiency of our methods over state-of-the-art learning and mining approaches several of which also attempt to combine link and content analysis for the purposes of community discovery. Specifically we always find a qualitative benefit when combining content with link analysis. Additionally our biased graph sampling approach realizes a quantitative benefit in that it is typically several orders of magnitude faster than competing approaches.

266 citations


Posted Content
TL;DR: This paper develops a new method to investigate the meso-scale feature known as core-periphery structure, which entails identifying densely connected core nodes and sparsely connected peripheral nodes in a network.
Abstract: Intermediate-scale (or `meso-scale') structures in networks have received considerable attention, as the algorithmic detection of such structures makes it possible to discover network features that are not apparent either at the local scale of nodes and edges or at the global scale of summary statistics. Numerous types of meso-scale structures can occur in networks, but investigations of such features have focused predominantly on the identification and study of community structure. In this paper, we develop a new method to investigate the meso-scale feature known as core-periphery structure, which entails identifying densely-connected core nodes and sparsely-connected periphery nodes. In contrast to communities, the nodes in a core are also reasonably well-connected to those in the periphery. Our new method of computing core-periphery structure can identify multiple cores in a network and takes different possible cores into account. We illustrate the differences between our method and several existing methods for identifying which nodes belong to a core, and we use our technique to examine core-periphery structure in examples of friendship, collaboration, transportation, and voting networks.

239 citations


Posted Content
TL;DR: It is shown that, although stronger ties are individually more influential, it is the more abundant weak ties who are responsible for the propagation of novel information, suggesting that weak ties may play a more dominant role in the dissemination of information online than currently believed.
Abstract: Online social networking technologies enable individuals to simultaneously share information with any number of peers. Quantifying the causal effect of these technologies on the dissemination of information requires not only identification of who influences whom, but also of whether individuals would still propagate information in the absence of social signals about that information. We examine the role of social networks in online information diffusion with a large-scale field experiment that randomizes exposure to signals about friends' information sharing among 253 million subjects in situ. Those who are exposed are significantly more likely to spread information, and do so sooner than those who are not exposed. We further examine the relative role of strong and weak ties in information propagation. We show that, although stronger ties are individually more influential, it is the more abundant weak ties who are responsible for the propagation of novel information. This suggests that weak ties may play a more dominant role in the dissemination of information online than currently believed.

234 citations


Posted Content
TL;DR: An on-line algorithm that relies on stochastic convex optimization to efficiently solve the dynamic network inference problem and studies the evolution of information pathways in the online media space.
Abstract: Diffusion of information, spread of rumors and infectious diseases are all instances of stochastic processes that occur over the edges of an underlying network. Many times networks over which contagions spread are unobserved, and such networks are often dynamic and change over time. In this paper, we investigate the problem of inferring dynamic networks based on information diffusion data. We assume there is an unobserved dynamic network that changes over time, while we observe the results of a dynamic process spreading over the edges of the network. The task then is to infer the edges and the dynamics of the underlying network. We develop an on-line algorithm that relies on stochastic convex optimization to efficiently solve the dynamic network inference problem. We apply our algorithm to information diffusion among 3.3 million mainstream media and blog sites and experiment with more than 179 million different pieces of information spreading over the network in a one year period. We study the evolution of information pathways in the online media space and find interesting insights. Information pathways for general recurrent topics are more stable across time than for on-going news events. Clusters of news media sites and blogs often emerge and vanish in matter of days for on-going news events. Major social movements and events involving civil population, such as the Libyan's civil war or Syria's uprise, lead to an increased amount of information pathways among blogs as well as in the overall increase in the network centrality of blogs and social media sites.

230 citations


Posted Content
TL;DR: It is found that deceptive opinion spam is a growing problem overall, but with different growth rates across communities, and a theoretical model of online reviews based on economic signaling theory, in which consumer reviews diminish the inherent information asymmetry between consumers and producers, by acting as a signal to a product's true, unknown quality.
Abstract: Consumers' purchase decisions are increasingly influenced by user-generated online reviews. Accordingly, there has been growing concern about the potential for posting "deceptive opinion spam" -- fictitious reviews that have been deliberately written to sound authentic, to deceive the reader. But while this practice has received considerable public attention and concern, relatively little is known about the actual prevalence, or rate, of deception in online review communities, and less still about the factors that influence it. We propose a generative model of deception which, in conjunction with a deception classifier, we use to explore the prevalence of deception in six popular online review communities: Expedia, this http URL, Orbitz, Priceline, TripAdvisor, and Yelp. We additionally propose a theoretical model of online reviews based on economic signaling theory, in which consumer reviews diminish the inherent information asymmetry between consumers and producers, by acting as a signal to a product's true, unknown quality. We find that deceptive opinion spam is a growing problem overall, but with different growth rates across communities. These rates, we argue, are driven by the different signaling costs associated with deception for each review community, e.g., posting requirements. When measures are taken to increase signaling cost, e.g., filtering reviews written by first-time reviewers, deception prevalence is effectively reduced.

215 citations


Posted Content
TL;DR: In this paper, a sample path based approach is proposed to find the information source based on the snapshot and the network topology, where the estimator of the source is chosen to be the root node associated with the sample path that most likely leads to the observed snapshot.
Abstract: This paper studies the problem of detecting the information source in a network in which the spread of information follows the popular Susceptible-Infected-Recovered (SIR) model. We assume all nodes in the network are in the susceptible state initially except the information source which is in the infected state. Susceptible nodes may then be infected by infected nodes, and infected nodes may recover and will not be infected again after recovery. Given a snapshot of the network, from which we know all infected nodes but cannot distinguish susceptible nodes and recovered nodes, the problem is to find the information source based on the snapshot and the network topology. We develop a sample path based approach where the estimator of the information source is chosen to be the root node associated with the sample path that most likely leads to the observed snapshot. We prove for infinite-trees, the estimator is a node that minimizes the maximum distance to the infected nodes. A reverse-infection algorithm is proposed to find such an estimator in general graphs. We prove that for $g$-regular trees such that $gq>1,$ where $g$ is the node degree and $q$ is the infection probability, the estimator is within a constant distance from the actual source with a high probability, independent of the number of infected nodes and the time the snapshot is taken. Our simulation results show that for tree networks, the estimator produced by the reverse-infection algorithm is closer to the actual source than the one identified by the closeness centrality heuristic. We then further evaluate the performance of the reverse infection algorithm on several real world networks.

181 citations


Proceedings ArticleDOI
TL;DR: A static greedy algorithm is proposed, named StaticGreedy, to strictly guarantee the submodularity of influence spread function during the seed selection process, which makes the computational expense dramatically reduced by two orders of magnitude without loss of accuracy.
Abstract: Influence maximization, defined as a problem of finding a set of seed nodes to trigger a maximized spread of influence, is crucial to viral marketing on social networks. For practical viral marketing on large scale social networks, it is required that influence maximization algorithms should have both guaranteed accuracy and high scalability. However, existing algorithms suffer a scalability-accuracy dilemma: conventional greedy algorithms guarantee the accuracy with expensive computation, while the scalable heuristic algorithms suffer from unstable accuracy. In this paper, we focus on solving this scalability-accuracy dilemma. We point out that the essential reason of the dilemma is the surprising fact that the submodularity, a key requirement of the objective function for a greedy algorithm to approximate the optimum, is not guaranteed in all conventional greedy algorithms in the literature of influence maximization. Therefore a greedy algorithm has to afford a huge number of Monte Carlo simulations to reduce the pain caused by unguaranteed submodularity. Motivated by this critical finding, we propose a static greedy algorithm, named StaticGreedy, to strictly guarantee the submodularity of influence spread function during the seed selection process. The proposed algorithm makes the computational expense dramatically reduced by two orders of magnitude without loss of accuracy. Moreover, we propose a dynamical update strategy which can speed up the StaticGreedy algorithm by 2-7 times on large scale social networks.

158 citations


Journal ArticleDOI
TL;DR: To enable the analysis of group evolution a change indicator—inclusion measure was proposed and has been used in a new method for exploring the evolution of social groups, called Group Evolution Discovery (GED).
Abstract: The continuous interest in the social network area contributes to the fast development of this field. The new possibilities of obtaining and storing data facilitate deeper analysis of the entire network, extracted social groups and single individuals as well. One of the most interesting research topic is the dynamics of social groups, it means analysis of group evolution over time. Having appropriate knowledge and methods for dynamic analysis, one may attempt to predict the future of the group, and then manage it properly in order to achieve or change this predicted future according to specific needs. Such ability would be a powerful tool in the hands of human resource managers, personnel recruitment, marketing, etc. The social group evolution consists of individual events and seven types of such changes have been identified in the paper: continuing, shrinking, growing, splitting, merging, dissolving and forming. To enable the analysis of group evolution a change indicator - inclusion measure was proposed. It has been used in a new method for exploring the evolution of social groups, called Group Evolution Discovery (GED). The experimental results of its use together with the comparison to two well-known algorithms in terms of accuracy, execution time, flexibility and ease of implementation are also described in the paper.

143 citations


Posted Content
TL;DR: In this article, the authors investigate how such teams can be formed on a social network and design efficient approximate algorithms for finding near optimum teams with provable guarantees, which achieve significant improvement in finding effective teams, as compared to naive strategies.
Abstract: In a team formation problem, one is required to find a group of users that can match the requirements of a collaborative task. Example of such collaborative tasks abound, ranging from software product development to various participatory sensing tasks in knowledge creation. Due to the nature of the task, team members are often required to work on a co-operative basis. Previous studies have indicated that co-operation becomes effective in presence of social connections. Therefore, effective team selection requires the team members to be socially close as well as a division of the task among team members so that no user is overloaded by the assignment. In this work, we investigate how such teams can be formed on a social network. Since our team formation problems are proven to be NP-hard, we design efficient approximate algorithms for finding near optimum teams with provable guarantees. As traditional data-sets from on-line social networks (e.g. Twitter, Facebook etc) typically do not contain instances of large scale collaboration, we have crawled millions of software repositories spanning a period of four years and hundreds of thousands of developers from GitHub, a popular open-source social coding network. We perform large scale experiments on this data-set to evaluate the accuracy and efficiency of our algorithms. Experimental results suggest that our algorithms achieve significant improvement in finding effective teams, as compared to naive strategies and scale well with the size of the data. Finally, we provide a validation of our techniques by comparing with existing software teams in GitHub.

127 citations


Journal ArticleDOI
TL;DR: A comprehensive comparative study of a representative set of community detection methods, in which community-oriented topological measures are used to qualify the communities and evaluate their deviation from the reference structure and it turns out there is no equivalence between the two approaches.
Abstract: Community detection is one of the most active fields in complex networks analysis, due to its potential value in practical applications. Many works inspired by different paradigms are devoted to the development of algorithmic solutions allowing to reveal the network structure in such cohesive subgroups. Comparative studies reported in the literature usually rely on a performance measure considering the community structure as a partition (Rand Index, Normalized Mutual information, etc.). However, this type of comparison neglects the topological properties of the communities. In this article, we present a comprehensive comparative study of a representative set of community detection methods, in which we adopt both types of evaluation. Community-oriented topological measures are used to qualify the communities and evaluate their deviation from the reference structure. In order to mimic real-world systems, we use artificially generated realistic networks. It turns out there is no equivalence between both approaches: a high performance does not necessarily correspond to correct topological properties, and vice-versa. They can therefore be considered as complementary, and we recommend applying both of them in order to perform a complete and accurate assessment.

Posted Content
TL;DR: The realms which can be predicted with current social media are discussed, available predictors and techniques of prediction are overviewed, challenges and possible future directions are discussed.
Abstract: Social media comprises interactive applications and platforms for creating, sharing and exchange of user-generated contents The past ten years have brought huge growth in social media, especially online social networking services, and it is changing our ways to organize and communicate It aggregates opinions and feelings of diverse groups of people at low cost Mining the attributes and contents of social media gives us an opportunity to discover social structure characteristics, analyze action patterns qualitatively and quantitatively, and sometimes the ability to predict future human related events In this paper, we firstly discuss the realms which can be predicted with current social media, then overview available predictors and techniques of prediction, and finally discuss challenges and possible future directions

Posted Content
TL;DR: A large user study on the ability of humans to detect today's Sybil accounts is conducted, using a large corpus of ground-truth Sybils from the Facebook and Renren networks and finds that while turkers vary significantly in their effectiveness, experts consistently produce near-optimal results.
Abstract: As popular tools for spreading spam and malware, Sybils (or fake accounts) pose a serious threat to online communities such as Online Social Networks (OSNs). Today, sophisticated attackers are creating realistic Sybils that effectively befriend legitimate users, rendering most automated Sybil detection techniques ineffective. In this paper, we explore the feasibility of a crowdsourced Sybil detection system for OSNs. We conduct a large user study on the ability of humans to detect today's Sybil accounts, using a large corpus of ground-truth Sybil accounts from the Facebook and Renren networks. We analyze detection accuracy by both "experts" and "turkers" under a variety of conditions, and find that while turkers vary significantly in their effectiveness, experts consistently produce near-optimal results. We use these results to drive the design of a multi-tier crowdsourcing Sybil detection system. Using our user study data, we show that this system is scalable, and can be highly effective either as a standalone system or as a complementary technique to current tools.

Journal ArticleDOI
TL;DR: HyperMap is presented, a simple method to map a given real network to its hyperbolic space and has a remarkable predictive power: Using the resulting map, it can predict missing links in the Internet with high precision, outperforming popular existing methods.
Abstract: Recent years have shown a promising progress in understanding geometric underpinnings behind the structure, function, and dynamics of many complex networks in nature and society. However these promises cannot be readily fulfilled and lead to important practical applications, without a simple, reliable, and fast network mapping method to infer the latent geometric coordinates of nodes in a real network. Here we present HyperMap, a simple method to map a given real network to its hyperbolic space. The method utilizes a recent geometric theory of complex networks modeled as random geometric graphs in hyperbolic spaces. The method replays the network's geometric growth, estimating at each time step the hyperbolic coordinates of new nodes in a growing network by maximizing the likelihood of the network snapshot in the model. We apply HyperMap to the AS Internet, and find that: 1) the method produces meaningful results, identifying soft communities of ASs belonging to the same geographic region; 2) the method has a remarkable predictive power: using the resulting map, we can predict missing links in the Internet with high precision, outperforming popular existing methods; and 3) the resulting map is highly navigable, meaning that a vast majority of greedy geometric routing paths are successful and low-stretch. Even though the method is not without limitations, and is open for improvement, it occupies a unique attractive position in the space of trade-offs between simplicity, accuracy, and computational complexity.

Journal ArticleDOI
TL;DR: In this paper, the authors identify actual clusters of patents and give predictions about the temporal changes of the structure of the clusters, and a predictor is defined for characterizing technological development to show how a patent cited by other patents belongs to various industrial fields.
Abstract: The network of patents connected by citations is an evolving graph, which provides a representation of the innovation process. A patent citing another implies that the cited patent reflects a piece of previously existing knowledge that the citing patent builds upon. A methodology presented here (i) identifies actual clusters of patents: i.e. technological branches, and (ii) gives predictions about the temporal changes of the structure of the clusters. A predictor, called the {citation vector}, is defined for characterizing technological development to show how a patent cited by other patents belongs to various industrial fields. The clustering technique adopted is able to detect the new emerging recombinations, and predicts emerging new technology clusters. The predictive ability of our new method is illustrated on the example of USPTO subcategory 11, Agriculture, Food, Textiles. A cluster of patents is determined based on citation data up to 1991, which shows significant overlap of the class 442 formed at the beginning of 1997. These new tools of predictive analytics could support policy decision making processes in science and technology, and help formulate recommendations for action.

Posted Content
TL;DR: This work proposes NetSimile -- a novel, effective, and scalable method for solving the node-correspondence problem, and shows how it enables several mining tasks such as clustering, visualization, discontinuity detection, network transfer learning, and re-identification across networks.
Abstract: Given a set of k networks, possibly with different sizes and no overlaps in nodes or edges, how can we quickly assess similarity between them, without solving the node-correspondence problem? Analogously, how can we extract a small number of descriptive, numerical features from each graph that effectively serve as the graph's "signature"? Having such features will enable a wealth of graph mining tasks, including clustering, outlier detection, visualization, etc. We propose NetSimile -- a novel, effective, and scalable method for solving the aforementioned problem. NetSimile has the following desirable properties: (a) It gives similarity scores that are size-invariant. (b) It is scalable, being linear on the number of edges for "signature" vector extraction. (c) It does not need to solve the node-correspondence problem. We present extensive experiments on numerous synthetic and real graphs from disparate domains, and show NetSimile's superiority over baseline competitors. We also show how NetSimile enables several mining tasks such as clustering, visualization, discontinuity detection, network transfer learning, and re-identification across networks.

Posted Content
TL;DR: It is shown that selecting the set of most influential source nodes in the continuous time influence maximization problem is NP-hard and an efficient approximation algorithm with provable near-optimal performance is developed.
Abstract: The problem of finding the optimal set of source nodes in a diffusion network that maximizes the spread of information, influence, and diseases in a limited amount of time depends dramatically on the underlying temporal dynamics of the network. However, this still remains largely unexplored to date. To this end, given a network and its temporal dynamics, we first describe how continuous time Markov chains allow us to analytically compute the average total number of nodes reached by a diffusion process starting in a set of source nodes. We then show that selecting the set of most influential source nodes in the continuous time influence maximization problem is NP-hard and develop an efficient approximation algorithm with provable near-optimal performance. Experiments on synthetic and real diffusion networks show that our algorithm outperforms other state of the art algorithms by at least ~20% and is robust across different network topologies.

Proceedings ArticleDOI
TL;DR: This work proposes a new method based on wedge sampling that allows for the fast and accurate approximation of all current variants of clustering coefficients and enables rapid uniform sampling of the triangles of a graph.
Abstract: Graphs are used to model interactions in a variety of contexts, and there is a growing need to quickly assess the structure of a graph. Some of the most useful graph metrics, especially those measuring social cohesion, are based on triangles. Despite the importance of these triadic measures, associated algorithms can be extremely expensive. We propose a new method based on wedge sampling. This versatile technique allows for the fast and accurate approximation of all current variants of clustering coefficients and enables rapid uniform sampling of the triangles of a graph. Our methods come with provable and practical time-approximation tradeoffs for all computations. We provide extensive results that show our methods are orders of magnitude faster than the state-of-the-art, while providing nearly the accuracy of full enumeration. Our results will enable more wide-scale adoption of triadic measures for analysis of extremely large graphs, as demonstrated on several real-world examples.

Posted Content
TL;DR: In this article, the authors consider time-critical influence maximization, in which one wants to maximize influence spread within a given deadline, and extend the Independent Cascade (IC) model and the Linear Threshold (LT) model to incorporate the time delay aspect of influence diffusion among individuals in social networks.
Abstract: Influence maximization is a problem of finding a small set of highly influential users, also known as seeds, in a social network such that the spread of influence under certain propagation models is maximized. In this paper, we consider time-critical influence maximization, in which one wants to maximize influence spread within a given deadline. Since timing is considered in the optimization, we also extend the Independent Cascade (IC) model and the Linear Threshold (LT) model to incorporate the time delay aspect of influence diffusion among individuals in social networks. We show that time-critical influence maximization under the time-delayed IC and LT models maintains desired properties such as submodularity, which allows a greedy approximation algorithm to achieve an approximation ratio of $1-1/e$. To overcome the inefficiency of the greedy algorithm, we design two heuristic algorithms: the first one is based on a dynamic programming procedure that computes exact influence in tree structures and directed acyclic subgraphs, while the second one converts the problem to one in the original models and then applies existing fast heuristic algorithms to it. Our simulation results demonstrate that our algorithms achieve the same level of influence spread as the greedy algorithm while running a few orders of magnitude faster, and they also outperform existing fast heuristics that disregard the deadline constraint and delays in diffusion.

Posted Content
TL;DR: It is proved that the given threshold correctly identifies a transition on the behaviour of belief propagation from insensitive to sensitive, and that the same threshold corresponds to the transition in a related inference problem on a tree model from infeasible to feasible.
Abstract: We consider the problem of community detection from observed interactions between individuals, in the context where multiple types of interaction are possible. We use labelled stochastic block models to represent the observed data, where labels correspond to interaction types. Focusing on a two-community scenario, we conjecture a threshold for the problem of reconstructing the hidden communities in a way that is correlated with the true partition. To substantiate the conjecture, we prove that the given threshold correctly identifies a transition on the behaviour of belief propagation from insensitive to sensitive. We further prove that the same threshold corresponds to the transition in a related inference problem on a tree model from infeasible to feasible. Finally, numerical results using belief propagation for community detection give further support to the conjecture.

Proceedings ArticleDOI
TL;DR: This work extends the classical Linear Threshold model to incorporate prices and valuations, and factor them into users' decision-making process of adopting a product, and shows that of the three algorithms, PAGE, which assigns prices dynamically based on the profit potential of each candidate seed, has the best performance both in the expected profit achieved and in running time.
Abstract: Influence maximization is the problem of finding a set of influential users in a social network such that the expected spread of influence under a certain propagation model is maximized. Much of the previous work has neglected the important distinction between social influence and actual product adoption. However, as recognized in the management science literature, an individual who gets influenced by social acquaintances may not necessarily adopt a product (or technology), due, e.g., to monetary concerns. In this work, we distinguish between influence and adoption by explicitly modeling the states of being influenced and of adopting a product. We extend the classical Linear Threshold (LT) model to incorporate prices and valuations, and factor them into users' decision-making process of adopting a product. We show that the expected profit function under our proposed model maintains submodularity under certain conditions, but no longer exhibits monotonicity, unlike the expected influence spread function. To maximize the expected profit under our extended LT model, we employ an unbudgeted greedy framework to propose three profit maximization algorithms. The results of our detailed experimental study on three real-world datasets demonstrate that of the three algorithms, \textsf{PAGE}, which assigns prices dynamically based on the profit potential of each candidate seed, has the best performance both in the expected profit achieved and in running time.

Journal ArticleDOI
TL;DR: It is shown that reciprocated and unreciprocated friendships obey different statistics, suggesting different formation processes, and that rankings are correlated with other characteristics of the participants that are traditionally associated with status, such as age and overall popularity as measured by total number of friends.
Abstract: In empirical studies of friendship networks participants are typically asked, in interviews or questionnaires, to identify some or all of their close friends, resulting in a directed network in which friendships can, and often do, run in only one direction between a pair of individuals. Here we analyze a large collection of such networks representing friendships among students at US high and junior-high schools and show that the pattern of unreciprocated friendships is far from random. In every network, without exception, we find that there exists a ranking of participants, from low to high, such that almost all unreciprocated friendships consist of a lower-ranked individual claiming friendship with a higher-ranked one. We present a maximum-likelihood method for deducing such rankings from observed network data and conjecture that the rankings produced reflect a measure of social status. We note in particular that reciprocated and unreciprocated friendships obey different statistics, suggesting different formation processes, and that rankings are correlated with other characteristics of the participants that are traditionally associated with status, such as age and overall popularity as measured by total number of friends.

Posted Content
TL;DR: A model of the contagious spread of information in a global-scale, publicly-articulated social network is developed and it is shown that a simple method can yield not just early detection, but advance warning of contagious outbreaks.
Abstract: Recent research has focused on the monitoring of global-scale online data for improved detection of epidemics, mood patterns, movements in the stock market, political revolutions, box-office revenues, consumer behaviour and many other important phenomena. However, privacy considerations and the sheer scale of data available online are quickly making global monitoring infeasible, and existing methods do not take full advantage of local network structure to identify key nodes for monitoring. Here, we develop a model of the contagious spread of information in a global-scale, publicly-articulated social network and show that a simple method can yield not just early detection, but advance warning of contagious outbreaks. In this method, we randomly choose a small fraction of nodes in the network and then we randomly choose a "friend" of each node to include in a group for local monitoring. Using six months of data from most of the full Twittersphere, we show that this friend group is more central in the network and it helps us to detect viral outbreaks of the use of novel hashtags about 7 days earlier than we could with an equal-sized randomly chosen group. Moreover, the method actually works better than expected due to network structure alone because highly central actors are both more active and exhibit increased diversity in the information they transmit to others. These results suggest that local monitoring is not just more efficient, it is more effective, and it is possible that other contagious processes in global-scale networks may be similarly monitored.

Posted Content
TL;DR: In this paper, the authors use queueing theory to analyze the retainer model for real-time crowdsourcing, in particular its expected wait time and cost to requesters, and propose and analyze three techniques to improve performance: push notifications, shared retainer pools, and precruitment.
Abstract: Realtime crowdsourcing research has demonstrated that it is possible to recruit paid crowds within seconds by managing a small, fast-reacting worker pool. Realtime crowds enable crowd-powered systems that respond at interactive speeds: for example, cameras, robots and instant opinion polls. So far, these techniques have mainly been proof-of-concept prototypes: research has not yet attempted to understand how they might work at large scale or optimize their cost/performance trade-offs. In this paper, we use queueing theory to analyze the retainer model for realtime crowdsourcing, in particular its expected wait time and cost to requesters. We provide an algorithm that allows requesters to minimize their cost subject to performance requirements. We then propose and analyze three techniques to improve performance: push notifications, shared retainer pools, and precruitment, which involves recalling retainer workers before a task actually arrives. An experimental validation finds that precruited workers begin a task 500 milliseconds after it is posted, delivering results below the one-second cognitive threshold for an end-user to stay in flow.

Posted Content
TL;DR: This work extracts follower graphs of active Digg and Twitter users and tracks how interest in news stories cascades through the graph, comparing and contrast properties of information cascades on both sites and elucidate what they tell us about dynamics of information flow on networks.
Abstract: Social networks have emerged as a critical factor in information dissemination, search, marketing, expertise and influence discovery, and potentially an important tool for mobilizing people. Social media has made social networks ubiquitous, and also given researchers access to massive quantities of data for empirical analysis. These data sets offer a rich source of evidence for studying dynamics of individual and group behavior, the structure of networks and global patterns of the flow of information on them. However, in most previous studies, the structure of the underlying networks was not directly visible but had to be inferred from the flow of information from one individual to another. As a result, we do not yet understand dynamics of information spread on networks or how the structure of the network affects it. We address this gap by analyzing data from two popular social news sites. Specifically, we extract follower graphs of active Digg and Twitter users and track how interest in news stories cascades through the graph. We compare and contrast properties of information cascades on both sites and elucidate what they tell us about dynamics of information flow on networks.

Posted Content
TL;DR: A study of emerging social Q&A site Quora shows that primary sources of information on Quora are judged authoritative and users judge the reputation of other users based on their past contributions.
Abstract: As social Q&A sites gain popularity, it is important to understand how users judge the authoritativeness of users and content, build reputation, and identify and promote high quality content. We conducted a study of emerging social Q&A site Quora. First, we describe user activity on Quora by analyzing data across 60 question topics and 3917 users. Then we provide a rich understanding of issues of authority, reputation, and quality from in-depth interviews with ten Quora users. Our results show that primary sources of information on Quora are judged authoritative. Also, users judge the reputation of other users based on their past contributions. Social voting helps users identify and promote good content but is prone to preferential attachment. Combining social voting with sophisticated algorithms for ranking content might enable users to better judge others’ reputation and promote high quality content.

Posted Content
TL;DR: This work rigorously considers and evaluates various hypotheses about underlying consumer and merchant behavior in order to understand the Groupon effect, and suggests an additional novel hypothesis: reviews from Groupon users are lower on average because such reviews correspond to real, unbiased customers, while the body of reviews on Yelp contain some fraction of reviews from biased or even potentially fake sources.
Abstract: Daily deals sites such as Groupon offer deeply discounted goods and services to tens of millions of customers through geographically targeted daily e-mail marketing campaigns. In our prior work we observed that a negative side effect for merchants using Groupons is that, on average, their Yelp ratings decline significantly. However, this previous work was essentially observational, rather than explanatory. In this work, we rigorously consider and evaluate various hypotheses about underlying consumer and merchant behavior in order to understand this phenomenon, which we dub the Groupon effect. We use statistical analysis and mathematical modeling, leveraging a dataset we collected spanning tens of thousands of daily deals and over 7 million Yelp reviews. In particular, we investigate hypotheses such as whether Groupon subscribers are more critical than their peers, or whether some fraction of Groupon merchants provide significantly worse service to customers using Groupons. We suggest an additional novel hypothesis: reviews from Groupon subscribers are lower on average because such reviews correspond to real, unbiased customers, while the body of reviews on Yelp contain some fraction of reviews from biased or even potentially fake sources. Although we focus on a specific question, our work provides broad insights into both consumer and merchant behavior within the daily deals marketplace.

Posted Content
TL;DR: In this article, a mixed integer programming (MIP) formulation is proposed to maximize the expected spread of cascades in networks, where a cascade model models the dispersal of wild animals through a fragmented landscape.
Abstract: We introduce a new optimization framework to maximize the expected spread of cascades in networks. Our model allows a rich set of actions that directly manipulate cascade dynamics by adding nodes or edges to the network. Our motivating application is one in spatial conservation planning, where a cascade models the dispersal of wild animals through a fragmented landscape. We propose a mixed integer programming (MIP) formulation that combines elements from network design and stochastic optimization. Our approach results in solutions with stochastic optimality guarantees and points to conservation strategies that are fundamentally different from naive approaches.

Book ChapterDOI
TL;DR: This study generates networks thanks to the most realistic model available to date and applies five community detection algorithms on these networks, finding out the performance assessed quantitatively does not necessarily agree with a qualitative analysis of the identified communities.
Abstract: Community detection is a very active field in complex networks analysis, consisting in identifying groups of nodes more densely interconnected relatively to the rest of the network. The existing algorithms are usually tested and compared on real-world and artificial networks, their performance being assessed through some partition similarity measure. However, artificial networks realism can be questioned, and the appropriateness of those measures is not obvious. In this study, we take advantage of recent advances concerning the characterization of community structures to tackle these questions. We first generate networks thanks to the most realistic model available to date. Their analysis reveals they display only some of the properties observed in real-world community structures. We then apply five community detection algorithms on these networks and find out the performance assessed quantitatively does not necessarily agree with a qualitative analysis of the identified communities. It therefore seems both approaches should be applied to perform a relevant comparison of the algorithms.

Posted Content
TL;DR: This article develops a scalable approximation algorithm with provable near-optimal performance based on submodular maximization which achieves a high accuracy in such scenario, solving an open problem first introduced by Gomez-Rodriguez et al (2010).
Abstract: Diffusion and propagation of information, influence and diseases take place over increasingly larger networks. We observe when a node copies information, makes a decision or becomes infected but networks are often hidden or unobserved. Since networks are highly dynamic, changing and growing rapidly, we only observe a relatively small set of cascades before a network changes significantly. Scalable network inference based on a small cascade set is then necessary for understanding the rapidly evolving dynamics that govern diffusion. In this article, we develop a scalable approximation algorithm with provable near-optimal performance based on submodular maximization which achieves a high accuracy in such scenario, solving an open problem first introduced by Gomez-Rodriguez et al (2010). Experiments on synthetic and real diffusion data show that our algorithm in practice achieves an optimal trade-off between accuracy and running time.