scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Social and Information Networks in 2014"


Proceedings ArticleDOI
TL;DR: DeepWalk is an online learning algorithm which builds useful incremental results, and is trivially parallelizable, which make it suitable for a broad class of real world applications such as network classification, and anomaly detection.
Abstract: We present DeepWalk, a novel approach for learning latent representations of vertices in a network. These latent representations encode social relations in a continuous vector space, which is easily exploited by statistical models. DeepWalk generalizes recent advancements in language modeling and unsupervised feature learning (or deep learning) from sequences of words to graphs. DeepWalk uses local information obtained from truncated random walks to learn latent representations by treating walks as the equivalent of sentences. We demonstrate DeepWalk's latent representations on several multi-label network classification tasks for social networks such as BlogCatalog, Flickr, and YouTube. Our results show that DeepWalk outperforms challenging baselines which are allowed a global view of the network, especially in the presence of missing information. DeepWalk's representations can provide $F_1$ scores up to 10% higher than competing methods when labeled data is sparse. In some experiments, DeepWalk's representations are able to outperform all baseline methods while using 60% less training data. DeepWalk is also scalable. It is an online learning algorithm which builds useful incremental results, and is trivially parallelizable. These qualities make it suitable for a broad class of real world applications such as network classification, and anomaly detection.

1,629 citations


Journal ArticleDOI
TL;DR: In this paper, the authors discuss the characteristics of modern, sophisticated social bots and how their presence can endanger online ecosystems and our society, and review current efforts to detect social bots on Twitter.
Abstract: The Turing test aimed to recognize the behavior of a human from that of a computer algorithm. Such challenge is more relevant than ever in today's social media context, where limited attention and technology constrain the expressive power of humans, while incentives abound to develop software agents mimicking humans. These social bots interact, often unnoticed, with real people in social media ecosystems, but their abundance is uncertain. While many bots are benign, one can design harmful bots with the goals of persuading, smearing, or deceiving. Here we discuss the characteristics of modern, sophisticated social bots, and how their presence can endanger online ecosystems and our society. We then review current efforts to detect social bots on Twitter. Features related to content, network, sentiment, and temporal patterns of activity are imitated by bots but at the same time can help discriminate synthetic behaviors from human ones, yielding signatures of engineered social tampering.

1,229 citations


Posted Content
TL;DR: A comprehensive survey of the state-of-the-art methods for anomaly detection in data represented as graphs can be found in this article, where the authors highlight the effectiveness, scalability, generality, and robustness aspects of the methods.
Abstract: Detecting anomalies in data is a vital task, with numerous high-impact applications in areas such as security, finance, health care, and law enforcement. While numerous techniques have been developed in past years for spotting outliers and anomalies in unstructured collections of multi-dimensional points, with graph data becoming ubiquitous, techniques for structured {\em graph} data have been of focus recently. As objects in graphs have long-range correlations, a suite of novel technology has been developed for anomaly detection in graph data. This survey aims to provide a general, comprehensive, and structured overview of the state-of-the-art methods for anomaly detection in data represented as graphs. As a key contribution, we provide a comprehensive exploration of both data mining and machine learning algorithms for these {\em detection} tasks. we give a general framework for the algorithms categorized under various settings: unsupervised vs. (semi-)supervised approaches, for static vs. dynamic graphs, for attributed vs. plain graphs. We highlight the effectiveness, scalability, generality, and robustness aspects of the methods. What is more, we stress the importance of anomaly {\em attribution} and highlight the major techniques that facilitate digging out the root cause, or the `why', of the detected anomalies for further analysis and sense-making. Finally, we present several real-world applications of graph-based anomaly detection in diverse domains, including financial, auction, computer traffic, and social networks. We conclude our survey with a discussion on open theoretical and practical challenges in the field.

703 citations


Proceedings ArticleDOI
TL;DR: This paper develops Communities from Edge Structure and Node Attributes (CESNA), an accurate and scalable algorithm for detecting overlapping communities in networks with node attributes that statistically models the interaction between the network structure and the node attributes, which leads to more accurate community detection as well as improved robustness in the presence of noise in thenetwork structure.
Abstract: Community detection algorithms are fundamental tools that allow us to uncover organizational principles in networks. When detecting communities, there are two possible sources of information one can use: the network structure, and the features and attributes of nodes. Even though communities form around nodes that have common edges and common attributes, typically, algorithms have only focused on one of these two data modalities: community detection algorithms traditionally focus only on the network structure, while clustering algorithms mostly consider only node attributes. In this paper, we develop Communities from Edge Structure and Node Attributes (CESNA), an accurate and scalable algorithm for detecting overlapping communities in networks with node attributes. CESNA statistically models the interaction between the network structure and the node attributes, which leads to more accurate community detection as well as improved robustness in the presence of noise in the network structure. CESNA has a linear runtime in the network size and is able to process networks an order of magnitude larger than comparable approaches. Last, CESNA also helps with the interpretation of detected communities by finding relevant node attributes for each community.

635 citations


Posted Content
TL;DR: TIM is presented, an algorithm that aims to bridge the theory and practice in influence maximization and outperforms the state-of-the-art solutions (with approximation guarantees) by up to four orders of magnitude in terms of running time.
Abstract: Given a social network G and a constant k, the influence maximization problem asks for k nodes in G that (directly and indirectly) influence the largest number of nodes under a pre-defined diffusion model. This problem finds important applications in viral marketing, and has been extensively studied in the literature. Existing algorithms for influence maximization, however, either trade approximation guarantees for practical efficiency, or vice versa. In particular, among the algorithms that achieve constant factor approximations under the prominent independent cascade (IC) model or linear threshold (LT) model, none can handle a million-node graph without incurring prohibitive overheads. This paper presents TIM, an algorithm that aims to bridge the theory and practice in influence maximization. On the theory side, we show that TIM runs in O((k+\ell) (n+m) \log n / \epsilon^2) expected time and returns a (1-1/e-\epsilon)-approximate solution with at least 1 - n^{-\ell} probability. The time complexity of TIM is near-optimal under the IC model, as it is only a \log n factor larger than the \Omega(m + n) lower-bound established in previous work (for fixed k, \ell, and \epsilon). Moreover, TIM supports the triggering model, which is a general diffusion model that includes both IC and LT as special cases. On the practice side, TIM incorporates novel heuristics that significantly improve its empirical efficiency without compromising its asymptotic performance. We experimentally evaluate TIM with the largest datasets ever tested in the literature, and show that it outperforms the state-of-the-art solutions (with approximation guarantees) by up to four orders of magnitude in terms of running time. In particular, when k = 50, \epsilon = 0.2, and \ell = 1, TIM requires less than one hour on a commodity machine to process a network with 41.6 million nodes and 1.4 billion edges.

609 citations


Posted Content
TL;DR: In this article, the authors consider methodological and conceptual challenges for this emergent field, with special attention to the validity and representativeness of social media big data analyses, and conclude with a call to action on practical steps to improve our analytic capacity in this promising, rapidly growing field.
Abstract: Large-scale databases of human activity in social media have captured scientific and policy attention, producing a flood of research and discussion. This paper considers methodological and conceptual challenges for this emergent field, with special attention to the validity and representativeness of social media big data analyses. Persistent issues include the over-emphasis of a single platform, Twitter, sampling biases arising from selection by hashtags, and vague and unrepresentative sampling frames. The socio-cultural complexity of user behavior aimed at algorithmic invisibility (such as subtweeting, mock-retweeting, use of "screen captures" for text, etc.) further complicate interpretation of big data social media. Other challenges include accounting for field effects, i.e. broadly consequential events that do not diffuse only through the network under study but affect the whole society. The application of network methods from other fields to the study of human social activity may not always be appropriate. The paper concludes with a call to action on practical steps to improve our analytic capacity in this promising, rapidly-growing field.

520 citations


Proceedings ArticleDOI
TL;DR: This work develops a framework for addressing cascade prediction problems, and finds that the relative growth of a cascade becomes more predictable as the authors observe more of its reshares, that temporal and structural features are key predictors of cascade size, and that initially, breadth, rather than depth, is a better indicator of larger cascades.
Abstract: On many social networking web sites such as Facebook and Twitter, resharing or reposting functionality allows users to share others' content with their own friends or followers. As content is reshared from user to user, large cascades of reshares can form. While a growing body of research has focused on analyzing and characterizing such cascades, a recent, parallel line of work has argued that the future trajectory of a cascade may be inherently unpredictable. In this work, we develop a framework for addressing cascade prediction problems. On a large sample of photo reshare cascades on Facebook, we find strong performance in predicting whether a cascade will continue to grow in the future. We find that the relative growth of a cascade becomes more predictable as we observe more of its reshares, that temporal and structural features are key predictors of cascade size, and that initially, breadth, rather than depth in a cascade is a better indicator of larger cascades. This prediction performance is robust in the sense that multiple distinct classes of features all achieve similar performance. We also discover that temporal features are predictive of a cascade's eventual shape. Observing independent cascades of the same content, we find that while these cascades differ greatly in size, we are still able to predict which ends up the largest.

413 citations


Posted Content
TL;DR: In this paper, the authors proposed an efficient algorithm based on a semidefinite programming relaxation of ML, which is proved to succeed in recovering the communities close to the threshold, while numerical experiments suggest it may achieve the threshold.
Abstract: The stochastic block model (SBM) with two communities, or equivalently the planted bisection model, is a popular model of random graph exhibiting a cluster behaviour. In the symmetric case, the graph has two equally sized clusters and vertices connect with probability $p$ within clusters and $q$ across clusters. In the past two decades, a large body of literature in statistics and computer science has focused on providing lower-bounds on the scaling of $|p-q|$ to ensure exact recovery. In this paper, we identify a sharp threshold phenomenon for exact recovery: if $\alpha=pn/\log(n)$ and $\beta=qn/\log(n)$ are constant (with $\alpha>\beta$), recovering the communities with high probability is possible if $\frac{\alpha+\beta}{2} - \sqrt{\alpha \beta}>1$ and impossible if $\frac{\alpha+\beta}{2} - \sqrt{\alpha \beta}<1$. In particular, this improves the existing bounds. This also sets a new line of sight for efficient clustering algorithms. While maximum likelihood (ML) achieves the optimal threshold (by definition), it is in the worst-case NP-hard. This paper proposes an efficient algorithm based on a semidefinite programming relaxation of ML, which is proved to succeed in recovering the communities close to the threshold, while numerical experiments suggest it may achieve the threshold. An efficient algorithm which succeeds all the way down to the threshold is also obtained using a partial recovery algorithm combined with a local improvement procedure.

392 citations


Posted Content
TL;DR: In this paper, the authors use data from a crowd-sourcing platform that shows two street scenes in London and a user votes on which one looks more beautiful, quiet, and happy.
Abstract: When providing directions to a place, web and mobile mapping services are all able to suggest the shortest route. The goal of this work is to automatically suggest routes that are not only short but also emotionally pleasant. To quantify the extent to which urban locations are pleasant, we use data from a crowd-sourcing platform that shows two street scenes in London (out of hundreds), and a user votes on which one looks more beautiful, quiet, and happy. We consider votes from more than 3.3K individuals and translate them into quantitative measures of location perceptions. We arrange those locations into a graph upon which we learn pleasant routes. Based on a quantitative validation, we find that, compared to the shortest routes, the recommended ones add just a few extra walking minutes and are indeed perceived to be more beautiful, quiet, and happy. To test the generality of our approach, we consider Flickr metadata of more than 3.7M pictures in London and 1.3M in Boston, compute proxies for the crowdsourced beauty dimension (the one for which we have collected the most votes), and evaluate those proxies with 30 participants in London and 54 in Boston. These participants have not only rated our recommendations but have also carefully motivated their choices, providing insights for future work.

300 citations


Posted Content
TL;DR: This work examines the complete dynamics of the Twitter information network, where users post and reshare information while they also create and destroy connections, and develops a model that can successfully predict which information diffusion events will lead to bursts in network dynamics.
Abstract: In online social media systems users are not only posting, consuming, and resharing content, but also creating new and destroying existing connections in the underlying social network. While each of these two types of dynamics has individually been studied in the past, much less is known about the connection between the two. How does user information posting and seeking behavior interact with the evolution of the underlying social network structure? Here, we study ways in which network structure reacts to users posting and sharing content. We examine the complete dynamics of the Twitter information network, where users post and reshare information while they also create and destroy connections. We find that the dynamics of network structure can be characterized by steady rates of change, interrupted by sudden bursts. Information diffusion in the form of cascades of post re-sharing often creates such sudden bursts of new connections, which significantly change users' local network structure. These bursts transform users' networks of followers to become structurally more cohesive as well as more homogenous in terms of follower interests. We also explore the effect of the information content on the dynamics of the network and find evidence that the appearance of new topics and real-world events can lead to significant changes in edge creations and deletions. Lastly, we develop a model that quantifies the dynamics of the network and the occurrence of these bursts as a function of the information spreading through the network. The model can successfully predict which information diffusion events will lead to bursts in network dynamics.

201 citations


Posted Content
TL;DR: The design, implementation, and evaluation of EMBERS, an automated, 24x7 continuous system for forecasting civil unrest across 10 countries of Latin America using open source indicators such as tweets, news sources, blogs, economic indicators, and other data sources are described.
Abstract: We describe the design, implementation, and evaluation of EMBERS, an automated, 24x7 continuous system for forecasting civil unrest across 10 countries of Latin America using open source indicators such as tweets, news sources, blogs, economic indicators, and other data sources. Unlike retrospective studies, EMBERS has been making forecasts into the future since Nov 2012 which have been (and continue to be) evaluated by an independent T&E team (MITRE). Of note, EMBERS has successfully forecast the uptick and downtick of incidents during the June 2013 protests in Brazil. We outline the system architecture of EMBERS, individual models that leverage specific data sources, and a fusion and suppression engine that supports trading off specific evaluation criteria. EMBERS also provides an audit trail interface that enables the investigation of why specific predictions were made along with the data utilized for forecasting. Through numerous evaluations, we demonstrate the superiority of EMBERS over baserate methods and its capability to forecast significant societal happenings.

Posted Content
TL;DR: A new algorithm for inferring the home location of Twitter users at different granularities, including city, state, time zone, or geographic region, using the content of users’ tweets and their tweeting behavior is presented.
Abstract: We present a new algorithm for inferring the home location of Twitter users at different granularities, including city, state, time zone or geographic region, using the content of users tweets and their tweeting behavior. Unlike existing approaches, our algorithm uses an ensemble of statistical and heuristic classifiers to predict locations and makes use of a geographic gazetteer dictionary to identify place-name entities. We find that a hierarchical classification approach, where time zone, state or geographic region is predicted first and city is predicted next, can improve prediction accuracy. We have also analyzed movement variations of Twitter users, built a classifier to predict whether a user was travelling in a certain period of time and use that to further improve the location detection accuracy. Experimental evidence suggests that our algorithm works well in practice and outperforms the best existing algorithms for predicting the home location of Twitter users.

Posted Content
TL;DR: The authors investigated the predictability of successful memes using their early spreading patterns in the underlying social networks and found that features based on community structure are the most powerful predictors of future success.
Abstract: We investigate the predictability of successful memes using their early spreading patterns in the underlying social networks. We propose and analyze a comprehensive set of features and develop an accurate model to predict future popularity of a meme given its early spreading patterns. Our paper provides the first comprehensive comparison of existing predictive frameworks. We categorize our features into three groups: influence of early adopters, community concentration, and characteristics of adoption time series. We find that features based on community structure are the most powerful predictors of future success. We also find that early popularity of a meme is not a good predictor of its future popularity, contrary to common belief. Our methods outperform other approaches, particularly in the task of detecting very popular or unpopular memes.

Posted Content
TL;DR: In this paper, a large-scale quantitative study of information overload and its impact on information dissemination in the Twitter social media site is conducted, where the authors model social media users as information processing systems that queue incoming information according to some policies, process information from the queue at some unknown rates and decide to forward some of the incoming information to other users.
Abstract: Information overload has become an ubiquitous problem in modern society. Social media users and microbloggers receive an endless flow of information, often at a rate far higher than their cognitive abilities to process the information. In this paper, we conduct a large scale quantitative study of information overload and evaluate its impact on information dissemination in the Twitter social media site. We model social media users as information processing systems that queue incoming information according to some policies, process information from the queue at some unknown rates and decide to forward some of the incoming information to other users. We show how timestamped data about tweets received and forwarded by users can be used to uncover key properties of their queueing policies and estimate their information processing rates and limits. Such an understanding of users' information processing behaviors allows us to infer whether and to what extent users suffer from information overload. Our analysis provides empirical evidence of information processing limits for social media users and the prevalence of information overloading. The most active and popular social media users are often the ones that are overloaded. Moreover, we find that the rate at which users receive information impacts their processing behavior, including how they prioritize information from different sources, how much information they process, and how quickly they process information. Finally, the susceptibility of a social media user to social contagions depends crucially on the rate at which she receives information. An exposure to a piece of information, be it an idea, a convention or a product, is much less effective for users that receive information at higher rates, meaning they need more exposures to adopt a particular contagion.

Posted Content
TL;DR: In this article, a reinforced Poisson process is used to model explicitly the process through which individual items gain their popularity, and a generative probabilistic framework is proposed to predict the popularity dynamics of individual items within a complex evolving system.
Abstract: An ability to predict the popularity dynamics of individual items within a complex evolving system has important implications in an array of areas. Here we propose a generative probabilistic framework using a reinforced Poisson process to model explicitly the process through which individual items gain their popularity. This model distinguishes itself from existing models via its capability of modeling the arrival process of popularity and its remarkable power at predicting the popularity of individual items. It possesses the flexibility of applying Bayesian treatment to further improve the predictive power using a conjugate prior. Extensive experiments on a longitudinal citation dataset demonstrate that this model consistently outperforms existing popularity prediction methods.

Posted Content
TL;DR: In this article, the authors present a deterministic, local algorithm to compute the heat kernel, a particular type of graph diffusion that is useful in identifying a community nearby a starting seed node.
Abstract: The heat kernel is a particular type of graph diffusion that, like the much-used personalized PageRank diffusion, is useful in identifying a community nearby a starting seed node. We present the first deterministic, local algorithm to compute this diffusion and use that algorithm to study the communities that it produces. Our algorithm is formally a relaxation method for solving a linear system to estimate the matrix exponential in a degree-weighted norm. We prove that this algorithm stays localized in a large graph and has a worst-case constant runtime that depends only on the parameters of the diffusion, not the size of the graph. Our experiments on real-world networks indicate that the communities produced by this method have better conductance than those produced by PageRank, although they take slightly longer to compute on large graphs. On a real-world community identification task, the heat kernel communities perform better than those from the PageRank diffusion.

Posted Content
TL;DR: This work studies how Facebook users consumed different information at the edge of political discussion and news during the last Italian electoral competition and reveals that users which are prominently interacting with conspiracists information sources are more prone to interact with intentional false claims.
Abstract: In this work we study, on a sample of 2.3 million individuals, how Facebook users consumed different information at the edge of political discussion and news during the last Italian electoral competition. Pages are categorized, according to their topics and the communities of interests they pertain to, in a) alternative information sources (diffusing topics that are neglected by science and main stream media); b) online political activism; and c) main stream media. We show that attention patterns are similar despite the different qualitative nature of the information, meaning that unsubstantiated claims (mainly conspiracy theories) reverberate for as long as other information. Finally, we categorize users according to their interaction patterns among the different topics and measure how a sample of this social ecosystem (1279 users) responded to the injection of 2788 false information posts. Our analysis reveals that users which are prominently interacting with alternative information sources (i.e. more exposed to unsubstantiated claims) are more prone to interact with false claims.

Posted Content
TL;DR: In this article, the effect of wording on the popularity of social media content has been studied, where the authors take advantage of the surprising fact that there are many pairs of tweets containing the same url and written by the same user but employing different wording.
Abstract: Consider a person trying to spread an important message on a social network. He/she can spend hours trying to craft the message. Does it actually matter? While there has been extensive prior work looking into predicting popularity of social-media content, the effect of wording per se has rarely been studied since it is often confounded with the popularity of the author and the topic. To control for these confounding factors, we take advantage of the surprising fact that there are many pairs of tweets containing the same url and written by the same user but employing different wording. Given such pairs, we ask: which version attracts more retweets? This turns out to be a more difficult task than predicting popular topics. Still, humans can answer this question better than chance (but far from perfectly), and the computational methods we develop can do better than both an average human and a strong competing method trained on non-controlled data.

Proceedings ArticleDOI
TL;DR: Investigations suggest that Reddit has transformed itself from a dedicated gateway to the Web to an increasingly self-referential community that focuses on and reinforces its own user-generated image- and textual content over external sources.
Abstract: In the past few years, Reddit -- a community-driven platform for submitting, commenting and rating links and text posts -- has grown exponentially, from a small community of users into one of the largest online communities on the Web. To the best of our knowledge, this work represents the most comprehensive longitudinal study of Reddit's evolution to date, studying both (i) how user submissions have evolved over time and (ii) how the community's allocation of attention and its perception of submissions have changed over 5 years based on an analysis of almost 60 million submissions. Our work reveals an ever-increasing diversification of topics accompanied by a simultaneous concentration towards a few selected domains both in terms of posted submissions as well as perception and attention. By and large, our investigations suggest that Reddit has transformed itself from a dedicated gateway to the Web to an increasingly self-referential community that focuses on and reinforces its own user-generated image- and textual content over external sources.

Proceedings ArticleDOI
TL;DR: This work investigates three major aspects of the structural characteristics of Instagram's network of heterogeneous interactions, to unveil the emergence of self organization and topically-induced community structure, the dynamics of content production and consumption and the behavior of users labeling media with tags.
Abstract: Online socio-technical systems can be studied as proxy of the real world to investigate human behavior and social interactions at scale. Here we focus on Instagram, a media-sharing online platform whose popularity has been rising up to gathering hundred millions users. Instagram exhibits a mixture of features including social structure, social tagging and media sharing. The network of social interactions among users models various dynamics including follower/followee relations and users' communication by means of posts/comments. Users can upload and tag media such as photos and pictures, and they can "like" and comment each piece of information on the platform. In this work we investigate three major aspects on our Instagram dataset: (i) the structural characteristics of its network of heterogeneous interactions, to unveil the emergence of self organization and topically-induced community structure; (ii) the dynamics of content production and consumption, to understand how global trends and popular users emerge; (iii) the behavior of users labeling media with tags, to determine how they devote their attention and to explore the variety of their topical interests. Our analysis provides clues to understand human behavior dynamics on socio-technical systems, specifically users and content popularity, the mechanisms of users' interactions in online environments and how collective trends emerge from individuals' topical interests.

Posted Content
TL;DR: A generic stream sampling framework for big-graph analytics, called Graph Sample and Hold (gSH), which samples from massive graphs sequentially in a single pass, one edge at a time, while maintaining a small state in memory is proposed.
Abstract: Sampling is a standard approach in big-graph analytics; the goal is to efficiently estimate the graph properties by consulting a sample of the whole population. A perfect sample is assumed to mirror every property of the whole population. Unfortunately, such a perfect sample is hard to collect in complex populations such as graphs (e.g. web graphs, social networks etc), where an underlying network connects the units of the population. Therefore, a good sample will be representative in the sense that graph properties of interest can be estimated with a known degree of accuracy. While previous work focused particularly on sampling schemes used to estimate certain graph properties (e.g. triangle count), much less is known for the case when we need to estimate various graph properties with the same sampling scheme. In this paper, we propose a generic stream sampling framework for big-graph analytics, called Graph Sample and Hold (gSH). To begin, the proposed framework samples from massive graphs sequentially in a single pass, one edge at a time, while maintaining a small state. We then show how to produce unbiased estimators for various graph properties from the sample. Given that the graph analysis algorithms will run on a sample instead of the whole population, the runtime complexity of these algorithm is kept under control. Moreover, given that the estimators of graph properties are unbiased, the approximation error is kept under control. Finally, we show the performance of the proposed framework (gSH) on various types of graphs, such as social graphs, among others.

Posted Content
TL;DR: The traditional SIR model of disease spread is modified by incorporating infectious recovery dynamics such that contact between a recovered and infected member of the population is required for recovery.
Abstract: The last decade has seen the rise of immense online social networks (OSNs) such as MySpace and Facebook. In this paper we use epidemiological models to explain user adoption and abandonment of OSNs, where adoption is analogous to infection and abandonment is analogous to recovery. We modify the traditional SIR model of disease spread by incorporating infectious recovery dynamics such that contact between a recovered and infected member of the population is required for recovery. The proposed infectious recovery SIR model (irSIR model) is validated using publicly available Google search query data for "MySpace" as a case study of an OSN that has exhibited both adoption and abandonment phases. The irSIR model is then applied to search query data for "Facebook," which is just beginning to show the onset of an abandonment phase. Extrapolating the best fit model into the future predicts a rapid decline in Facebook activity in the next few years.

Posted Content
TL;DR: This paper investigates the novel problem of negative link prediction with only positive links and content-centric interactions in social media, and proposes a principled framework NeLP, which can exploit positive linksand content-focused interactions to predict negative links.
Abstract: Signed network analysis has attracted increasing attention in recent years. This is in part because research on signed network analysis suggests that negative links have added value in the analytical process. A major impediment in their effective use is that most social media sites do not enable users to specify them explicitly. In other words, a gap exists between the importance of negative links and their availability in real data sets. Therefore, it is natural to explore whether one can predict negative links automatically from the commonly available social network data. In this paper, we investigate the novel problem of negative link prediction with only positive links and content-centric interactions in social media. We make a number of important observations about negative links, and propose a principled framework NeLP, which can exploit positive links and content-centric interactions to predict negative links. Our experimental results on real-world social networks demonstrate that the proposed NeLP framework can accurately predict negative links with positive links and content-centric interactions. Our detailed experiments also illustrate the relative importance of various factors to the effectiveness of the proposed framework.

Posted Content
TL;DR: This paper investigates the network structure inference problem for a general family of continuous-time diffusion models using an [Formula: see text]-regularized likelihood maximization framework and shows that this framework outperforms other alternatives in practice.
Abstract: Information spreads across social and technological networks, but often the network structures are hidden from us and we only observe the traces left by the diffusion processes, called cascades. Can we recover the hidden network structures from these observed cascades? What kind of cascades and how many cascades do we need? Are there some network structures which are more difficult than others to recover? Can we design efficient inference algorithms with provable guarantees? Despite the increasing availability of cascade data and methods for inferring networks from these data, a thorough theoretical understanding of the above questions remains largely unexplored in the literature. In this paper, we investigate the network structure inference problem for a general family of continuous-time diffusion models using an $l_1$-regularized likelihood maximization framework. We show that, as long as the cascade sampling process satisfies a natural incoherence condition, our framework can recover the correct network structure with high probability if we observe $O(d^3 \log N)$ cascades, where $d$ is the maximum number of parents of a node and $N$ is the total number of nodes. Moreover, we develop a simple and efficient soft-thresholding inference algorithm, which we use to illustrate the consequences of our theoretical results, and show that our framework outperforms other alternatives in practice.

Posted Content
TL;DR: The authors proposed a supervised learning approach for predicting crowdsourced decisions on toxic behavior with large-scale labeled data collections; over 10 million user reports involved in 1.46 million toxic players and corresponding crowdsourced decision.
Abstract: One problem facing players of competitive games is negative, or toxic, behavior. League of Legends, the largest eSport game, uses a crowdsourcing platform called the Tribunal to judge whether a reported toxic player should be punished or not. The Tribunal is a two stage system requiring reports from those players that directly observe toxic behavior, and human experts that review aggregated reports. While this system has successfully dealt with the vague nature of toxic behavior by majority rules based on many votes, it naturally requires tremendous cost, time, and human efforts. In this paper, we propose a supervised learning approach for predicting crowdsourced decisions on toxic behavior with large-scale labeled data collections; over 10 million user reports involved in 1.46 million toxic players and corresponding crowdsourced decisions. Our result shows good performance in detecting overwhelmingly majority cases and predicting crowdsourced decisions on them. We demonstrate good portability of our classifier across regions. Finally, we estimate the practical implications of our approach, potential cost savings and victim protection.

Posted Content
TL;DR: In this article, a taxonomy of individual learner behavior in a MOOC course was developed, examining the different behavioral patterns of high-and low-achieving students, and investigating how forum participation relates to other parts of the course.
Abstract: The Web has enabled one of the most visible recent developments in education---the deployment of massive open online courses. With their global reach and often staggering enrollments, MOOCs have the potential to become a major new mechanism for learning. Despite this early promise, however, MOOCs are still relatively unexplored and poorly understood. In a MOOC, each student's complete interaction with the course materials takes place on the Web, thus providing a record of learner activity of unprecedented scale and resolution. In this work, we use such trace data to develop a conceptual framework for understanding how users currently engage with MOOCs. We develop a taxonomy of individual behavior, examine the different behavioral patterns of high- and low-achieving students, and investigate how forum participation relates to other parts of the course. We also report on a large-scale deployment of badges as incentives for engagement in a MOOC, including randomized experiments in which the presentation of badges was varied across sub-populations. We find that making badges more salient produced increases in forum engagement.

Journal ArticleDOI
TL;DR: In this article, the authors propose a scalable temporal latent space model for link prediction in dynamic social networks, where the goal is to predict links over time based on a sequence of previous graph snapshots.
Abstract: We propose a scalable temporal latent space model for link prediction in dynamic social networks, where the goal is to predict links over time based on a sequence of previous graph snapshots. The model assumes that each user lies in an unobserved latent space and interactions are more likely to form between similar users in the latent space representation. In addition, the model allows each user to gradually move its position in the latent space as the network structure evolves over time. We present a global optimization algorithm to effectively infer the temporal latent space, with a quadratic convergence rate. Two alternative optimization algorithms with local and incremental updates are also proposed, allowing the model to scale to larger networks without compromising prediction accuracy. Empirically, we demonstrate that our model, when evaluated on a number of real-world dynamic networks, significantly outperforms existing approaches for temporal link prediction in terms of both scalability and predictive power.

Posted Content
TL;DR: The conclusion of this paper is that the microblogging platform Twitter is Bitcoin's virtual trading floor, emotionally reflecting its trading dynamics.
Abstract: This paper analyzes correlations and causalities between Bitcoin market indicators and Twitter posts containing emotional signals on Bitcoin. Within a timeframe of 104 days (November 23rd 2013 - March 7th 2014), about 160,000 Twitter posts containing "bitcoin" and a positive, negative or uncertainty related term were collected and further analyzed. For instance, the terms "happy", "love", "fun", "good", "bad", "sad" and "unhappy" represent positive and negative emotional signals, while "hope", "fear" and "worry" are considered as indicators of uncertainty. The static (daily) Pearson correlation results show a significant positive correlation between emotional tweets and the close price, trading volume and intraday price spread of Bitcoin. However, a dynamic Granger causality analysis does not confirm a statistically significant effect of emotional Tweets on Bitcoin market values. To the contrary, the analyzed data shows that a higher Bitcoin trading volume Granger causes more signals of uncertainty within a 24 to 72-hour timeframe. This result leads to the interpretation that emotional sentiments rather mirror the market than that they make it predictable. Finally, the conclusion of this paper is that the microblogging platform Twitter is Bitcoin's virtual trading floor, emotionally reflecting its trading dynamics.

Posted Content
TL;DR: The methodology applied to develop scalable solutions to network analysis problems, including techniques like parallelization, heuristics for computationally expensive problems, efficient data structures, and modular software architecture are described.
Abstract: We introduce NetworKit, an open-source software package for analyzing the structure of large complex networks. Appropriate algorithmic solutions are required to handle increasingly common large graph data sets containing up to billions of connections. We describe the methodology applied to develop scalable solutions to network analysis problems, including techniques like parallelization, heuristics for computationally expensive problems, efficient data structures, and modular software architecture. Our goal for the software is to package results of our algorithm engineering efforts and put them into the hands of domain experts. NetworKit is implemented as a hybrid combining the kernels written in C++ with a Python front end, enabling integration into the Python ecosystem of tested tools for data analysis and scientific computing. The package provides a wide range of functionality (including common and novel analytics algorithms and graph generators) and does so via a convenient interface. In an experimental comparison with related software, NetworKit shows the best performance on a range of typical analysis tasks.

Posted Content
TL;DR: The nucleus decomposition of a graph is defined, which represents the graph as a forest of nuclei, and provably efficient algorithms for nucleus decompositions are given, and empirically evaluate their behavior in a variety of real graphs.
Abstract: Finding dense substructures in a graph is a fundamental graph mining operation, with applications in bioinformatics, social networks, and visualization to name a few. Yet most standard formulations of this problem (like clique, quasiclique, k-densest subgraph) are NP-hard. Furthermore, the goal is rarely to find the "true optimum", but to identify many (if not all) dense substructures, understand their distribution in the graph, and ideally determine relationships among them. Current dense subgraph finding algorithms usually optimize some objective, and only find a few such subgraphs without providing any structural relations. We define the nucleus decomposition of a graph, which represents the graph as a forest of nuclei. Each nucleus is a subgraph where smaller cliques are present in many larger cliques. The forest of nuclei is a hierarchy by containment, where the edge density increases as we proceed towards leaf nuclei. Sibling nuclei can have limited intersections, which enables discovering overlapping dense subgraphs. With the right parameters, the nucleus decomposition generalizes the classic notions of k-cores and k-truss decompositions. We give provably efficient algorithms for nucleus decompositions, and empirically evaluate their behavior in a variety of real graphs. The tree of nuclei consistently gives a global, hierarchical snapshot of dense substructures, and outputs dense subgraphs of higher quality than other state-of-the-art solutions. Our algorithm can process graphs with tens of millions of edges in less than an hour.