scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Social and Information Networks in 2013"


Posted Content
TL;DR: Data collected using Twitter's sampled API service is compared with data collected using the full, albeit costly, Firehose stream that includes every single published tweet to help researchers and practitioners understand the implications of using the Streaming API.
Abstract: Twitter is a social media giant famous for the exchange of short, 140-character messages called "tweets". In the scientific community, the microblogging site is known for openness in sharing its data. It provides a glance into its millions of users and billions of tweets through a "Streaming API" which provides a sample of all tweets matching some parameters preset by the API user. The API service has been used by many researchers, companies, and governmental institutions that want to extract knowledge in accordance with a diverse array of questions pertaining to social media. The essential drawback of the Twitter API is the lack of documentation concerning what and how much data users get. This leads researchers to question whether the sampled data is a valid representation of the overall activity on Twitter. In this work we embark on answering this question by comparing data collected using Twitter's sampled API service with data collected using the full, albeit costly, Firehose stream that includes every single published tweet. We compare both datasets using common statistical metrics as well as metrics that allow us to compare topics, networks, and locations of tweets. The results of our work will help researchers and practitioners understand the implications of using the Streaming API.

848 citations


Posted Content
TL;DR: A definition of total variation for graph signals that naturally leads to a frequency ordering on graphs is proposed and defines low-, high-, and band-pass graph signals and filters.
Abstract: Signals and datasets that arise in physical and engineering applications, as well as social, genetics, biomolecular, and many other domains, are becoming increasingly larger and more complex. In contrast to traditional time and image signals, data in these domains are supported by arbitrary graphs. Signal processing on graphs extends concepts and techniques from traditional signal processing to data indexed by generic graphs. This paper studies the concepts of low and high frequencies on graphs, and low-, high-, and band-pass graph filters. In traditional signal processing, there concepts are easily defined because of a natural frequency ordering that has a physical interpretation. For signals residing on graphs, in general, there is no obvious frequency ordering. We propose a definition of total variation for graph signals that naturally leads to a frequency ordering on graphs and defines low-, high-, and band-pass graph signals and filters. We study the design of graph filters with specified frequency response, and illustrate our approach with applications to sensor malfunction detection and data classification.

574 citations


Journal ArticleDOI
TL;DR: An in-depth comparative review of the methods presented so far for clustering directed networks along with the relevant necessary methodological background and also related applications is offered.
Abstract: Networks (or graphs) appear as dominant structures in diverse domains, including sociology, biology, neuroscience and computer science. In most of the aforementioned cases graphs are directed - in the sense that there is directionality on the edges, making the semantics of the edges non symmetric. An interesting feature that real networks present is the clustering or community structure property, under which the graph topology is organized into modules commonly called communities or clusters. The essence here is that nodes of the same community are highly similar while on the contrary, nodes across communities present low similarity. Revealing the underlying community structure of directed complex networks has become a crucial and interdisciplinary topic with a plethora of applications. Therefore, naturally there is a recent wealth of research production in the area of mining directed graphs - with clustering being the primary method and tool for community detection and evaluation. The goal of this paper is to offer an in-depth review of the methods presented so far for clustering directed networks along with the relevant necessary methodological background and also related applications. The survey commences by offering a concise review of the fundamental concepts and methodological base on which graph clustering algorithms capitalize on. Then we present the relevant work along two orthogonal classifications. The first one is mostly concerned with the methodological principles of the clustering algorithms, while the second one approaches the methods from the viewpoint regarding the properties of a good cluster in a directed network. Further, we present methods and metrics for evaluating graph clustering results, demonstrate interesting application domains and provide promising future research directions.

507 citations


Journal ArticleDOI
TL;DR: In this article, the authors analyze geo-located Twitter messages in order to uncover global patterns of human mobility, revealing spatially cohesive regions that follow the regional division of the world.
Abstract: In the advent of a pervasive presence of location sharing services researchers gained an unprecedented access to the direct records of human activity in space and time. This paper analyses geo-located Twitter messages in order to uncover global patterns of human mobility. Based on a dataset of almost a billion tweets recorded in 2012 we estimate volumes of international travelers in respect to their country of residence. We examine mobility profiles of different nations looking at the characteristics such as mobility rate, radius of gyration, diversity of destinations and a balance of the inflows and outflows. The temporal patterns disclose the universal seasons of increased international mobility and the peculiar national nature of overseen travels. Our analysis of the community structure of the Twitter mobility network, obtained with the iterative network partitioning, reveals spatially cohesive regions that follow the regional division of the world. Finally, we validate our result with the global tourism statistics and mobility models provided by other authors, and argue that Twitter is a viable source to understand and quantify global mobility patterns.

402 citations


Posted Content
TL;DR: This article developed a latent factor recommendation system that explicitly accounts for each user's level of experience to recommend products that a user will enjoy now, while acknowledging that their tastes may have changed over time, and may change again in the future.
Abstract: Recommending products to consumers means not only understanding their tastes, but also understanding their level of experience. For example, it would be a mistake to recommend the iconic film Seven Samurai simply because a user enjoys other action movies; rather, we might conclude that they will eventually enjoy it -- once they are ready. The same is true for beers, wines, gourmet foods -- or any products where users have acquired tastes: the `best' products may not be the most `accessible'. Thus our goal in this paper is to recommend products that a user will enjoy now, while acknowledging that their tastes may have changed over time, and may change again in the future. We model how tastes change due to the very act of consuming more products -- in other words, as users become more experienced. We develop a latent factor recommendation system that explicitly accounts for each user's level of experience. We find that such a model not only leads to better recommendations, but also allows us to study the role of user experience and expertise on a novel dataset of fifteen million beer, wine, food, and movie reviews.

363 citations


Posted Content
TL;DR: In this article, a randomized algorithm for influence estimation in continuous-time diffusion networks is proposed, which can estimate the influence of every node in a network with |V| nodes and |E| edges to an accuracy of $\varepsilon$ using $n=O(1/ε)-ε 2$ randomizations and up to logarithmic factors O(n|E|+n|V|) computations.
Abstract: If a piece of information is released from a media site, can it spread, in 1 month, to a million web pages? This influence estimation problem is very challenging since both the time-sensitive nature of the problem and the issue of scalability need to be addressed simultaneously. In this paper, we propose a randomized algorithm for influence estimation in continuous-time diffusion networks. Our algorithm can estimate the influence of every node in a network with |V| nodes and |E| edges to an accuracy of $\varepsilon$ using $n=O(1/\varepsilon^2)$ randomizations and up to logarithmic factors O(n|E|+n|V|) computations. When used as a subroutine in a greedy influence maximization algorithm, our proposed method is guaranteed to find a set of nodes with an influence of at least (1-1/e)OPT-2$\varepsilon$, where OPT is the optimal value. Experiments on both synthetic and real-world data show that the proposed method can easily scale up to networks of millions of nodes while significantly improves over previous state-of-the-arts in terms of the accuracy of the estimated influence and the quality of the selected nodes in maximizing the influence.

227 citations


Proceedings ArticleDOI
TL;DR: This paper studies the predictive power of various machine learning features on the popularity of retail stores in the city through the use of a dataset collected from Foursquare in New York, suggesting that the retail success of a business may depend on multiple factors.
Abstract: The problem of identifying the optimal location for a new retail store has been the focus of past research, especially in the field of land economy, due to its importance in the success of a business. Traditional approaches to the problem have factored in demographics, revenue and aggregated human flow statistics from nearby or remote areas. However, the acquisition of relevant data is usually expensive. With the growth of location-based social networks, fine grained data describing user mobility and popularity of places has recently become attainable. In this paper we study the predictive power of various machine learning features on the popularity of retail stores in the city through the use of a dataset collected from Foursquare in New York. The features we mine are based on two general signals: geographic, where features are formulated according to the types and density of nearby places, and user mobility, which includes transitions between venues or the incoming flow of mobile users from distant areas. Our evaluation suggests that the best performing features are common across the three different commercial chains considered in the analysis, although variations may exist too, as explained by heterogeneities in the way retail facilities attract users. We also show that performance improves significantly when combining multiple features in supervised learning algorithms, suggesting that the retail success of a business may depend on multiple factors.

219 citations


Journal ArticleDOI
TL;DR: A thorough review of the different security and privacy risks, which threaten the well-being of OSN users in general, and children in particular, is presented and an overview of existing solutions that can provide better protection, security, and privacy forOSN users is presented.
Abstract: Many online social network (OSN) users are unaware of the numerous security risks that exist in these networks, including privacy violations, identity theft, and sexual harassment, just to name a few. According to recent studies, OSN users readily expose personal and private details about themselves, such as relationship status, date of birth, school name, email address, phone number, and even home address. This information, if put into the wrong hands, can be used to harm users both in the virtual world and in the real world. These risks become even more severe when the users are children. In this paper we present a thorough review of the different security and privacy risks which threaten the well-being of OSN users in general, and children in particular. In addition, we present an overview of existing solutions that can provide better protection, security, and privacy for OSN users. We also offer simple-to-implement recommendations for OSN users which can improve their security and privacy when using these platforms. Furthermore, we suggest future research directions.

209 citations


Posted Content
TL;DR: It is shown that proper cluster randomization can lead to exponentially lower estimator variance when experimentally measuring average treatment effects under interference, and if a graph satisfies a restricted-growth condition on the growth rate of neighborhoods, then there exists a natural clustering algorithm, based on vertex neighborhoods, for which the variance of the estimator can be upper bounded by a linear function of the degrees.
Abstract: A/B testing is a standard approach for evaluating the effect of online experiments; the goal is to estimate the `average treatment effect' of a new feature or condition by exposing a sample of the overall population to it. A drawback with A/B testing is that it is poorly suited for experiments involving social interference, when the treatment of individuals spills over to neighboring individuals along an underlying social network. In this work, we propose a novel methodology using graph clustering to analyze average treatment effects under social interference. To begin, we characterize graph-theoretic conditions under which individuals can be considered to be `network exposed' to an experiment. We then show how graph cluster randomization admits an efficient exact algorithm to compute the probabilities for each vertex being network exposed under several of these exposure conditions. Using these probabilities as inverse weights, a Horvitz-Thompson estimator can then provide an effect estimate that is unbiased, provided that the exposure model has been properly specified. Given an estimator that is unbiased, we focus on minimizing the variance. First, we develop simple sufficient conditions for the variance of the estimator to be asymptotically small in n, the size of the graph. However, for general randomization schemes, this variance can be lower bounded by an exponential function of the degrees of a graph. In contrast, we show that if a graph satisfies a restricted-growth condition on the growth rate of neighborhoods, then there exists a natural clustering algorithm, based on vertex neighborhoods, for which the variance of the estimator can be upper bounded by a linear function of the degrees. Thus we show that proper cluster randomization can lead to exponentially lower estimator variance when experimentally measuring average treatment effects under interference.

189 citations


Posted Content
TL;DR: DeltaCon as mentioned in this paper is a principled, intuitive, and scalable algorithm that assesses the similarity between two graphs on the same nodes (e.g., employees of a company, customers of a mobile carrier).
Abstract: How much did a network change since yesterday? How different is the wiring between Bob's brain (a left-handed male) and Alice's brain (a right-handed female)? Graph similarity with known node correspondence, i.e. the detection of changes in the connectivity of graphs, arises in numerous settings. In this work, we formally state the axioms and desired properties of the graph similarity functions, and evaluate when state-of-the-art methods fail to detect crucial connectivity changes in graphs. We propose DeltaCon, a principled, intuitive, and scalable algorithm that assesses the similarity between two graphs on the same nodes (e.g. employees of a company, customers of a mobile carrier). Experiments on various synthetic and real graphs showcase the advantages of our method over existing similarity measures. Finally, we employ DeltaCon to real applications: (a) we classify people to groups of high and low creativity based on their brain connectivity graphs, and (b) do temporal anomaly detection in the who-emails-whom Enron graph.

184 citations


Proceedings ArticleDOI
TL;DR: This work uses data from a large sample of Facebook users to investigate a particular category of strong ties, those involving spouses or romantic partners, and offers methods for identifying types of structurally significant people in on-line applications and suggests a potential expansion of existing theories of tie strength.
Abstract: A crucial task in the analysis of on-line social-networking systems is to identify important people --- those linked by strong social ties --- within an individual's network neighborhood. Here we investigate this question for a particular category of strong ties, those involving spouses or romantic partners. We organize our analysis around a basic question: given all the connections among a person's friends, can you recognize his or her romantic partner from the network structure alone? Using data from a large sample of Facebook users, we find that this task can be accomplished with high accuracy, but doing so requires the development of a new measure of tie strength that we term `dispersion' --- the extent to which two people's mutual friends are not themselves well-connected. The results offer methods for identifying types of structurally significant people in on-line applications, and suggest a potential expansion of existing theories of tie strength.

Proceedings ArticleDOI
TL;DR: It is shown that it is possible to model accurately the overall traffic articles will ultimately receive by observing the first ten to twenty minutes of social media reactions, and significant improvements on the accuracy of the early prediction of shelf-life for news stories are described.
Abstract: This paper presents a study of the life cycle of news articles posted online. We describe the interplay between website visitation patterns and social media reactions to news content. We show that we can use this hybrid observation method to characterize distinct classes of articles. We also find that social media reactions can help predict future visitation patterns early and accurately. We validate our methods using qualitative analysis as well as quantitative analysis on data from a large international news network, for a set of articles generating more than 3,000,000 visits and 200,000 social media reactions. We show that it is possible to model accurately the overall traffic articles will ultimately receive by observing the first ten to twenty minutes of social media reactions. Achieving the same prediction accuracy with visits alone would require to wait for three hours of data. We also describe significant improvements on the accuracy of the early prediction of shelf-life for news stories.

Posted Content
TL;DR: This survey discusses both classical text-book type properties and some advanced properties of graph sampling, and provides a taxonomy of different graph sampling objectives and graph sampling approaches.
Abstract: Graph sampling is a technique to pick a subset of vertices and/ or edges from original graph. It has a wide spectrum of applications, e.g. survey hidden population in sociology [54], visualize social graph [29], scale down Internet AS graph [27], graph sparsification [8], etc. In some scenarios, the whole graph is known and the purpose of sampling is to obtain a smaller graph. In other scenarios, the graph is unknown and sampling is regarded as a way to explore the graph. Commonly used techniques are Vertex Sampling, Edge Sampling and Traversal Based Sampling. We provide a taxonomy of different graph sampling objectives and graph sampling approaches. The relations between these approaches are formally argued and a general framework to bridge theoretical analysis and practical implementation is provided. Although being smaller in size, sampled graphs may be similar to original graphs in some way. We are particularly interested in what graph properties are preserved given a sampling procedure. If some properties are preserved, we can estimate them on the sampled graphs, which gives a way to construct efficient estimators. If one algorithm relies on the perserved properties, we can expect that it gives similar output on original and sampled graphs. This leads to a systematic way to accelerate a class of graph algorithms. In this survey, we discuss both classical text-book type properties and some advanced properties. The landscape is tabularized and we see a lot of missing works in this field. Some theoretical studies are collected in this survey and simple extensions are made. Most previous numerical evaluation works come in an ad hoc fashion, i.e. evaluate different type of graphs, different set of properties, and different sampling algorithms. A systematical and neutral evaluation is needed to shed light on further graph sampling studies.

Journal ArticleDOI
TL;DR: It is proposed that the proposed Block Two-Level Erdss-Renyi (BTER) model can be used as a graph generator for benchmarking purposes and provide idealized degree distributions and clustering coefficient profiles that can be tuned for user specifications.
Abstract: Network data is ubiquitous and growing, yet we lack realistic generative network models that can be calibrated to match real-world data. The recently proposed Block Two-Level Erdss-Renyi (BTER) model can be tuned to capture two fundamental properties: degree distribution and clustering coefficients. The latter is particularly important for reproducing graphs with community structure, such as social networks. In this paper, we compare BTER to other scalable models and show that it gives a better fit to real data. We provide a scalable implementation that requires only O(d_max) storage where d_max is the maximum number of neighbors for a single node. The generator is trivially parallelizable, and we show results for a Hadoop MapReduce implementation for a modeling a real-world web graph with over 4.6 billion edges. We propose that the BTER model can be used as a graph generator for benchmarking purposes and provide idealized degree distributions and clustering coefficient profiles that can be tuned for user specifications.

Proceedings ArticleDOI
TL;DR: A maximum a posteriori (MAP) estimator is constructed to identify the rumor source using the susceptible-infected (SI) model and sheds insight into the behavior of the rumor spreading process not only in the asymptotic regime but also for the general finite-n regime.
Abstract: Suppose that a rumor originating from a single source among a set of suspects spreads in a network, how to root out this rumor source? With the a priori knowledge of suspect nodes and an observation of infected nodes, we construct a maximum a posteriori (MAP) estimator to identify the rumor source using the susceptible-infected (SI) model. The a priori suspect set and its associated connectivity bring about new ingredients to the problem, and thus we propose to use local rumor center, a generalized concept based on rumor centrality, to identify the source from suspects. For regular tree-type networks of node degree {\delta}, we characterize Pc(n), the correct detection probability of the estimator upon observing n infected nodes, in both the finite and asymptotic regimes. First, when every infected node is a suspect, Pc(n) asymptotically grows from 0.25 to 0.307 with {\delta} from 3 to infinity, a result first established in Shah and Zaman (2011, 2012) via a different approach; and it monotonically decreases with n and increases with {\delta}. Second, when the suspects form a connected subgraph of the network, Pc(n) asymptotically significantly exceeds the a priori probability if {\delta}>2, and reliable detection is achieved as {\delta} becomes large; furthermore, it monotonically decreases with n and increases with {\delta}. Third, when there are only two suspects, Pc(n) is asymptotically at least 0.75 if {\delta}>2; and it increases with the distance between the two suspects. Fourth, when there are multiple suspects, among all possible connection patterns, that they form a connected subgraph of the network achieves the smallest detection probability. Our analysis leverages ideas from the Polya's urn model in probability theory and sheds insight into the behavior of the rumor spreading process not only in the asymptotic regime but also for the general finite-n regime.

Posted Content
TL;DR: This work applies survival theory to develop general additive and multiplicative risk models under which the network inference problems can be solved efficiently by exploiting their convexity.
Abstract: Networks provide a skeleton for the spread of contagions, like, information, ideas, behaviors and diseases. Many times networks over which contagions diffuse are unobserved and need to be inferred. Here we apply survival theory to develop general additive and multiplicative risk models under which the network inference problems can be solved efficiently by exploiting their convexity. Our additive risk model generalizes several existing network inference models. We show all these models are particular cases of our more general model. Our multiplicative model allows for modeling scenarios in which a node can either increase or decrease the risk of activation of another node, in contrast with previous approaches, which consider only positive risk increments. We evaluate the performance of our network inference algorithms on large synthetic and real cascade datasets, and show that our models are able to predict the length and duration of cascades in real data.

Posted Content
TL;DR: In this paper, the authors consider the problem of controlling the propagation of an epidemic outbreak in an arbitrary contact network by distributing vaccination resources throughout the network and propose a convex framework to find cost-optimal distribution of vaccination resources when different levels of vaccination are allowed.
Abstract: We consider the problem of controlling the propagation of an epidemic outbreak in an arbitrary contact network by distributing vaccination resources throughout the network. We analyze a networked version of the Susceptible-Infected-Susceptible (SIS) epidemic model when individuals in the network present different levels of susceptibility to the epidemic. In this context, controlling the spread of an epidemic outbreak can be written as a spectral condition involving the eigenvalues of a matrix that depends on the network structure and the parameters of the model. We study the problem of finding the optimal distribution of vaccines throughout the network to control the spread of an epidemic outbreak. We propose a convex framework to find cost-optimal distribution of vaccination resources when different levels of vaccination are allowed. We also propose a greedy approach with quality guarantees for the case of all-or-nothing vaccination. We illustrate our approaches with numerical simulations in a real social network.

Journal ArticleDOI
TL;DR: In this article, a probabilistic model for the evolution of the retweets using a Bayesian approach, and form predictions using only observations on the retweet times and the local network or graph structure of the retweeters.
Abstract: We predict the popularity of short messages called tweets created in the micro-blogging site known as Twitter. We measure the popularity of a tweet by the time-series path of its retweets, which is when people forward the tweet to others. We develop a probabilistic model for the evolution of the retweets using a Bayesian approach, and form predictions using only observations on the retweet times and the local network or "graph" structure of the retweeters. We obtain good step ahead forecasts and predictions of the final total number of retweets even when only a small fraction (i.e., less than one tenth) of the retweet path is observed. This translates to good predictions within a few minutes of a tweet being posted, and has potential implications for understanding the spread of broader ideas, memes, or trends in social networks.

Posted Content
TL;DR: This work defines more general measures of social imbalance (MOIs) based on l-cycles in the network and gives a simple sign prediction rule, and proposes an effective low rank modeling approach for both sign prediction and clustering.
Abstract: The study of social networks is a burgeoning research area. However, most existing work deals with networks that simply encode whether relationships exist or not. In contrast, relationships in signed networks can be positive ("like", "trust") or negative ("dislike", "distrust"). The theory of social balance shows that signed networks tend to conform to some local patterns that, in turn, induce certain global characteristics. In this paper, we exploit both local as well as global aspects of social balance theory for two fundamental problems in the analysis of signed networks: sign prediction and clustering. Motivated by local patterns of social balance, we first propose two families of sign prediction methods: measures of social imbalance (MOIs), and supervised learning using high order cycles (HOCs). These methods predict signs of edges based on triangles and \ell-cycles for relatively small values of \ell. Interestingly, by examining measures of social imbalance, we show that the classic Katz measure, which is used widely in unsigned link prediction, actually has a balance theoretic interpretation when applied to signed networks. Furthermore, motivated by the global structure of balanced networks, we propose an effective low rank modeling approach for both sign prediction and clustering. For the low rank modeling approach, we provide theoretical performance guarantees via convex relaxations, scale it up to large problem sizes using a matrix factorization based algorithm, and provide extensive experimental validation including comparisons with local approaches. Our experimental results indicate that, by adopting a more global viewpoint of balance structure, we get significant performance and computational gains in prediction and clustering tasks on signed networks. Our work therefore highlights the usefulness of the global aspect of balance theory for the analysis of signed networks.

Posted Content
TL;DR: In this paper, the authors adopt a well-established model for social-opinion dynamics and formalize the campaign-design problem as the problem of identifying a set of target individuals whose positive opinion about an information item will maximize the overall positive opinion for the item in the social network.
Abstract: The process of opinion formation through synthesis and contrast of different viewpoints has been the subject of many studies in economics and social sciences. Today, this process manifests itself also in online social networks and social media. The key characteristic of successful promotion campaigns is that they take into consideration such opinion-formation dynamics in order to create a overall favorable opinion about a specific information item, such as a person, a product, or an idea. In this paper, we adopt a well-established model for social-opinion dynamics and formalize the campaign-design problem as the problem of identifying a set of target individuals whose positive opinion about an information item will maximize the overall positive opinion for the item in the social network. We call this problem CAMPAIGN. We study the complexity of the CAMPAIGN problem, and design algorithms for solving it. Our experiments on real data demonstrate the efficiency and practical utility of our algorithms.

Posted Content
TL;DR: Wang et al. as mentioned in this paper showed that location obfuscation techniques can all be easily destroyed by an attacker with the capability of no more than a regular LBSN user and proposed to use grid reference system and location classification to mitigate the attacks.
Abstract: Many popular location-based social networks (LBSNs) support built-in location-based social discovery with hundreds of millions of users around the world. While user (near) realtime geographical information is essential to enable location-based social discovery in LBSNs, the importance of user location privacy has also been recognized by leading real-world LBSNs. To protect user's exact geographical location from being exposed, a number of location protection approaches have been adopted by the industry so that only relative location information are publicly disclosed. These techniques are assumed to be secure and are exercised on the daily base. In this paper, we question the safety of these location-obfuscation techniques used by existing LBSNs. We show, for the first time, through real world attacks that they can all be easily destroyed by an attacker with the capability of no more than a regular LBSN user. In particular, by manipulating location information fed to LBSN client app, an ill-intended regular user can easily deduce the exact location information by running LBSN apps as location oracle and performing a series of attacking strategies. We develop an automated user location tracking system and test it on the most popular LBSNs including Wechat, Skout and Momo. We demonstrate its effectiveness and efficiency via a 3 week real-world experiment with 30 volunteers. Our evaluation results show that we could geo-locate a target with high accuracy and can readily recover users' Top 5 locations. We also propose to use grid reference system and location classification to mitigate the attacks. Our work shows that the current industrial best practices on user location privacy protection are completely broken, and it is critical to address this immediate threat.

Posted Content
TL;DR: Experimental results demonstrate that the prediction accuracy is significantly improved by incorporating the factor of structural diversity into existing methods and it is found that the popularity of content is well reflected by the structural diversity of the early adopters.
Abstract: Predicting the popularity of content is important for both the host and users of social media sites. The challenge of this problem comes from the inequality of the popularity of con- tent. Existing methods for popularity prediction are mainly based on the quality of content, the interface of social media site to highlight contents, and the collective behavior of user- s. However, little attention is paid to the structural charac- teristics of the networks spanned by early adopters, i.e., the users who view or forward the content in the early stage of content dissemination. In this paper, taking the Sina Weibo as a case, we empirically study whether structural character- istics can provide clues for the popularity of short messages. We find that the popularity of content is well reflected by the structural diversity of the early adopters. Experimental results demonstrate that the prediction accuracy is signif- icantly improved by incorporating the factor of structural diversity into existing methods.

Proceedings ArticleDOI
TL;DR: In this article, the authors define social resilience as the ability of a community to withstand changes and quantify resilience by using the k-core analysis, to identify subsets of the network in which all users have at least k friends.
Abstract: We empirically analyze five online communities: Friendster, Livejournal, Facebook, Orkut, Myspace, to identify causes for the decline of social networks. We define social resilience as the ability of a community to withstand changes. We do not argue about the cause of such changes, but concentrate on their impact. Changes may cause users to leave, which may trigger further leaves of others who lost connection to their friends. This may lead to cascades of users leaving. A social network is said to be resilient if the size of such cascades can be limited. To quantify resilience, we use the k-core analysis, to identify subsets of the network in which all users have at least k friends. These connections generate benefits (b) for each user, which have to outweigh the costs (c) of being a member of the network. If this difference is not positive, users leave. After all cascades, the remaining network is the k-core of the original network determined by the cost-to-benefit c/b ratio. By analysing the cumulative distribution of k-cores we are able to calculate the number of users remaining in each community. This allows us to infer the impact of the c/b ratio on the resilience of these online communities. We find that the different online communities have different k-core distributions. Consequently, similar changes in the c/b ratio have a different impact on the amount of active users. As a case study, we focus on the evolution of Friendster. We identify time periods when new users entering the network observed an insufficient c/b ratio. This measure can be seen as a precursor of the later collapse of the community. Our analysis can be applied to estimate the impact of changes in the user interface, which may temporarily increase the c/b ratio, thus posing a threat for the community to shrink, or even to collapse.

Posted Content
TL;DR: This work introduces a real-world layered network combining different kinds of online and offline relationships, and presents an innovative methodology and related analysis tools suggesting the existence of hidden motifs traversing and correlating different representation layers.
Abstract: The study of complex networks has been historically based on simple graph data models representing relationships between individuals. However, often reality cannot be accurately captured by a flat graph model. This has led to the development of multi-layer networks. These models have the potential of becoming the reference tools in network data analysis, but require the parallel development of specific analysis methods explicitly exploiting the information hidden in-between the layers and the availability of a critical mass of reference data to experiment with the tools and investigate the real-world organization of these complex systems. In this work we introduce a real-world layered network combining different kinds of online and offline relationships, and present an innovative methodology and related analysis tools suggesting the existence of hidden motifs traversing and correlating different representation layers. We also introduce a notion of betweenness centrality for multiple networks. While some preliminary experimental evidence is reported, our hypotheses are still largely unverified, and in our opinion this calls for the availability of new analysis methods but also new reference multi-layer social network data.

Journal ArticleDOI
TL;DR: Two of the most widely used inference methods can be mapped directly onto versions of the standard minimum-cut graph partitioning problem, which allows us to apply any of the many well-understood partitioning algorithms to the solution of community detection problems.
Abstract: Many methods have been proposed for community detection in networks. Some of the most promising are methods based on statistical inference, which rest on solid mathematical foundations and return excellent results in practice. In this paper we show that two of the most widely used inference methods can be mapped directly onto versions of the standard minimum-cut graph partitioning problem, which allows us to apply any of the many well-understood partitioning algorithms to the solution of community detection problems. We illustrate the approach by adapting the Laplacian spectral partitioning method to perform community inference, testing the resulting algorithm on a range of examples, including computer-generated and real-world networks. Both the quality of the results and the running time rival the best previous methods.

Posted Content
TL;DR: This work proposes an unsupervised method for integrating multiple data views to produce a single unified graph representation, based on the combination of the k-nearest neighbour sets for users derived from each view.
Abstract: In many social networks, several different link relations will exist between the same set of users. Additionally, attribute or textual information will be associated with those users, such as demographic details or user-generated content. For many data analysis tasks, such as community finding and data visualisation, the provision of multiple heterogeneous types of user data makes the analysis process more complex. We propose an unsupervised method for integrating multiple data views to produce a single unified graph representation, based on the combination of the k-nearest neighbour sets for users derived from each view. These views can be either relation-based or feature-based. The proposed method is evaluated on a number of annotated multi-view Twitter datasets, where it is shown to support the discovery of the underlying community structure in the data.

Posted Content
TL;DR: In this paper, a new concept of community is defined, which groups together nodes sharing memberships to the same monodimensional communities in different single dimensions, and an algorithm called ABACUS (Apriori-BAsed Community Discoverer in mUltidimensional networkS) is proposed.
Abstract: Community Discovery in complex networks is the problem of detecting, for each node of the network, its membership to one of more groups of nodes, the communities, that are densely connected, or highly interactive, or, more in general, similar, according to a similarity function. So far, the problem has been widely studied in monodimensional networks, i.e. networks where only one connection between two entities can exist. However, real networks are often multidimensional, i.e., multiple connections between any two nodes can exist, either reflecting different kinds of relationships, or representing different values of the same type of tie. In this context, the problem of Community Discovery has to be redefined, taking into account multidimensional structure of the graph. We define a new concept of community that groups together nodes sharing memberships to the same monodimensional communities in the different single dimensions. As we show, such communities are meaningful and able to group highly correlated nodes, even if they might not be connected in any of the monodimensional networks. We devise ABACUS (Apriori-BAsed Community discoverer in mUltidimensional networkS), an algorithm that is able to extract multidimensional communities based on the apriori itemset miner applied to monodimensional community memberships. Experiments on two different real multidimensional networks confirm the meaningfulness of the introduced concepts, and open the way for a new class of algorithms for community discovery that do not rely on the dense connections among nodes.

Posted Content
TL;DR: This work develops and evaluates a range of approaches for two key sub-problems inherent in conversational curation: length prediction and the novel task of re-entry prediction, based on an analysis of the network structure and arrival pattern among the participants, as well as a novel dichotomy in the structure of long threads.
Abstract: Discussion threads form a central part of the experience on many Web sites, including social networking sites such as Facebook and Google Plus and knowledge creation sites such as Wikipedia. To help users manage the challenge of allocating their attention among the discussions that are relevant to them, there has been a growing need for the algorithmic curation of on-line conversations --- the development of automated methods to select a subset of discussions to present to a user. Here we consider two key sub-problems inherent in conversational curation: length prediction --- predicting the number of comments a discussion thread will receive --- and the novel task of re-entry prediction --- predicting whether a user who has participated in a thread will later contribute another comment to it. The first of these sub-problems arises in estimating how interesting a thread is, in the sense of generating a lot of conversation; the second can help determine whether users should be kept notified of the progress of a thread to which they have already contributed. We develop and evaluate a range of approaches for these tasks, based on an analysis of the network structure and arrival pattern among the participants, as well as a novel dichotomy in the structure of long threads. We find that for both tasks, learning-based approaches using these sources of information yield improvements for all the performance metrics we used.

Posted Content
Laurent Massoulié1
TL;DR: This work proves that for logarithmic length ℓ, the leading eigenvectors of this modified matrix B provide a non-trivial reconstruction of the underlying structure, thereby settling the conjecture of a sharp threshold on model parameters for community detection in sparse random graphs drawn from the stochastic block model.
Abstract: Decelle et al.\cite{Decelle11} conjectured the existence of a sharp threshold for community detection in sparse random graphs drawn from the stochastic block model. Mossel et al.\cite{Mossel12} established the negative part of the conjecture, proving impossibility of meaningful detection below the threshold. However the positive part of the conjecture remained elusive so far. Here we solve the positive part of the conjecture. We introduce a modified adjacency matrix $B$ that counts self-avoiding paths of a given length $\ell$ between pairs of nodes and prove that for logarithmic $\ell$, the leading eigenvectors of this modified matrix provide non-trivial detection, thereby settling the conjecture. A key step in the proof consists in establishing a {\em weak Ramanujan property} of matrix $B$. Namely, the spectrum of $B$ consists in two leading eigenvalues $\rho(B)$, $\lambda_2$ and $n-2$ eigenvalues of a lower order $O(n^{\epsilon}\sqrt{\rho(B)})$ for all $\epsilon>0$, $\rho(B)$ denoting $B$'s spectral radius. $d$-regular graphs are Ramanujan when their second eigenvalue verifies $|\lambda|\le 2 \sqrt{d-1}$. Random $d$-regular graphs have a second largest eigenvalue $\lambda$ of $2\sqrt{d-1}+o(1)$ (see Friedman\cite{friedman08}), thus being {\em almost} Ramanujan. Erdős-Renyi graphs with average degree $d$ at least logarithmic ($d=\Omega(\log n)$) have a second eigenvalue of $O(\sqrt{d})$ (see Feige and Ofek\cite{Feige05}), a slightly weaker version of the Ramanujan property. However this spectrum separation property fails for sparse ($d=O(1)$) Erdős-Renyi graphs. Our result thus shows that by constructing matrix $B$ through neighborhood expansion, we regularize the original adjacency matrix to eventually recover a weak form of the Ramanujan property.

Journal ArticleDOI
TL;DR: The proposed UEOC, which means 'unfold and extract overlapping communities', is a Markov-dynamics-based algorithm that depends on only one parameter whose value can be easily set, and it requires no prior knowledge of the hidden community structures.
Abstract: Detection of overlapping communities in complex networks has motivated recent research in the relevant fields. Aiming this problem, we propose a Markov dynamics based algorithm, called UEOC, which means, 'unfold and extract overlapping communities'. In UEOC, when identifying each natural community that overlaps, a Markov random walk method combined with a constraint strategy, which is based on the corresponding annealed network (degree conserving random network), is performed to unfold the community. Then, a cutoff criterion with the aid of a local community function, called conductance, which can be thought of as the ratio between the number of edges inside the community and those leaving it, is presented to extract this emerged community from the entire network. The UEOC algorithm depends on only one parameter whose value can be easily set, and it requires no prior knowledge on the hidden community structures. The proposed UEOC has been evaluated both on synthetic benchmarks and on some real-world networks, and was compared with a set of competing algorithms. Experimental result has shown that UEOC is highly effective and efficient for discovering overlapping communities.