scispace - formally typeset
Search or ask a question

Showing papers in "Social Network Analysis and Mining in 2014"


Journal ArticleDOI
TL;DR: Novel findings are reported that identify that the sentiment expressed in the tweet is statistically significantly predictive of both size and survival of information flows of this nature.
Abstract: Little is currently known about the factors that promote the propagation of information in online social networks following terrorist events. In this paper we took the case of the terrorist event in Woolwich, London in 2013 and built models to predict information flow size and survival using data derived from the popular social networking site Twitter. We define information flows as the propagation over time of information posted to Twitter via the action of retweeting. Following a comparison with different predictive methods, and due to the distribution exhibited by our dependent size measure, we used the zero-truncated negative binomial (ZTNB) regression method. To model survival, the Cox regression technique was used because it estimates proportional hazard rates for independent measures. Following a principal component analysis to reduce the dimensionality of the data, social, temporal and content factors of the tweet were used as predictors in both models. Given the likely emotive reaction caused by the event, we emphasize the influence of emotive content on propagation in the discussion section. From a sample of Twitter data collected following the event (N=427,330) we report novel findings that identify that the sentiment expressed in the tweet is statistically significantly predictive of both size and survival of information flows of this nature. Furthermore, the number of offline press reports relating to the event published on the day the tweet was made was a significant predictor of size, as was the tension expressed in a tweet in relation to survival. Finally, time lags between retweets and the co-occurrence of URLS and Hashtags also emerged as significant.

204 citations


Journal ArticleDOI
TL;DR: The Social Privacy Protector (SPP) as mentioned in this paper was developed to identify fake users in online social networks and to improve users' security and privacy by implementing different methods to detect fake profiles.
Abstract: The amount of personal information involuntarily exposed by users on online social networks is staggering, as shown in recent research. Moreover, recent reports indicate that these networks are inundated with tens of millions of fake user profiles, which may jeopardize the user’s security and privacy. To identify fake users in such networks and to improve users’ security and privacy, we developed the Social Privacy Protector (SPP) software for Facebook. This software contains three protection layers that improve user privacy by implementing different methods to identify fake profiles. The first layer identifies a user’s friends who might pose a threat and then restricts the access these “friends” have to the user’s personal information. The second layer is an expansion of Facebook’s basic privacy settings based on different types of social network usage profiles. The third layer alerts users about the number of installed applications on their Facebook profile that has access to their private information. An initial version of the SPP software received positive media coverage, and more than 3,000 users from more than 20 countries have installed the software, out of which 527 have used the software to restrict more than 9,000 friends. In addition, we estimate that more than 100 users have accepted the software’s recommendations and removed nearly 1,800 Facebook applications from their profiles. By analyzing the unique dataset obtained by the software in combination with machine learning techniques, we developed classifiers that are able to predict Facebook profiles with a high probability of being fake and consequently threaten the user’s security and privacy. Moreover, in this study, we present statistics generated by the SPP software on both user privacy settings and the number of applications installed on Facebook profiles. These statistics alarmingly demonstrate how vulnerable Facebook users’ information is to both fake profile attacks and third-party Facebook applications.

102 citations


Journal ArticleDOI
TL;DR: This paper addresses the problem of predicting the popularity of news articles based on user comments as a ranking problem and indicates that popularity prediction methods are adequate solutions for this ranking task and could be considered as a valuable alternative for automatic online news ranking.
Abstract: News articles are an engaging type of online content that captures the attention of a significant amount of Internet users. They are particularly enjoyed by mobile users and massively spread through online social platforms. As a result, there is an increased interest in discovering the articles that will become popular among users. This objective falls under the broad scope of content popularity prediction and has direct implications in the development of new services for online advertisement and content distribution. In this paper, we address the problem of predicting the popularity of news articles based on user comments. We formulate the prediction task as a ranking problem, where the goal is not to infer the precise attention that a content will receive but to accurately rank articles based on their predicted popularity. Using data obtained from two important news sites in France and Netherlands, we analyze the ranking effectiveness of two prediction models. Our results indicate that popularity prediction methods are adequate solutions for this ranking task and could be considered as a valuable alternative for automatic online news ranking.

97 citations


Journal ArticleDOI
TL;DR: This study surveys users on credibility perceptions regarding witness pictures posted on Twitter related to Hurricane Sandy, and unveils insight about tweet presentation, as well as features that users should look at when assessing the veracity of tweets in the context of fast-paced events.
Abstract: While Twitter provides an unprecedented opportunity to learn about breaking news and current events as they happen, it often produces skepticism among users as not all the information is accurate but also hoaxes are sometimes spread. While avoiding the diffusion of hoaxes is a major concern during fast-paced events such as natural disasters, the study of how users trust and verify information from tweets in these contexts has received little attention so far. We survey users on credibility perceptions regarding witness pictures posted on Twitter related to Hurricane Sandy. By examining credibility perceptions on features suggested for information verification in the field of epistemology, we evaluate their accuracy in determining whether pictures were real or fake compared to professional evaluations performed by experts. Our study unveils insight about tweet presentation, as well as features that users should look at when assessing the veracity of tweets in the context of fast-paced events. Some of our main findings include that while author details not readily available on Twitter feeds should be emphasized in order to facilitate verification of tweets, showing multiple tweets corroborating a fact misleads users to trusting what actually is a hoax. We contrast some of the behavioral patterns found on tweets with literature in psychology research.

93 citations


Journal ArticleDOI
TL;DR: This work builds prediction models based on a variety of contextual and behavioral features, training the models by resorting to a distant supervision approach and considering party candidates to have a predefined preference towards their respective parties, and uses the model to analyze the preference changes over the course of the election campaign.
Abstract: We study the problem of predicting the political preference of users on the Twitter network, showing that the political preference of users can be predicted from their Twitter behavior towards political parties. We show this by building prediction models based on a variety of contextual and behavioral features, training the models by resorting to a distant supervision approach and considering party candidates to have a predefined preference towards their respective parties. A language model for each party is learned from the content of the tweets by the party candidates, and the preference of a user is assessed based on the alignment of user tweets with the language models of the parties. We evaluate our work in the context of two real elections: 2012 Albertan and 2013 Pakistani general elections. In both cases, we show that our model outperforms, in terms of the F-measure, sentiment and text classification approaches and is at par with the human annotators. We further use our model to analyze the preference changes over the course of the election campaign and report results that would be difficult to attain by human annotators.

87 citations


Journal ArticleDOI
TL;DR: Two new models of competitive influence diffusion in social networks with the following three factors are proposed: a time deadline for information diffusion, random time delay of information exchange and personal interests regarding the acceptance of information.
Abstract: The spread of rumor or misinformation in social networks may cause bad effects among the public. Thus, it is necessary to find effective strategies to control the spread of rumor. Specifically, in our paper, we consider such a setting: initially, a subset of nodes is chosen as the set of protectors, and the influence of protector diffuses competitively with the diffusion of rumor. However, in real world, we generally have limited budget (limited number of protectors) and time to fight with rumor. Therefore, we study the problem of maximizing rumor containment within a fixed number of initial protectors and a given time deadline. Generalizing two seminal models in the field—the Independent Cascade (IC) model and the Linear Threshold (LT) model—we propose two new models of competitive influence diffusion in social networks with the following three factors: a time deadline for information diffusion, random time delay of information exchange and personal interests regarding the acceptance of information. Under these two models, we show that the optimization problems are NP-hard. Furthermore, we prove that the objective functions are submodular. As a result, the greedy algorithm is used as constant-factor approximation algorithms with performance guarantee \(1-\frac{1}{e}\) for the two optimization problems.

86 citations


Journal ArticleDOI
TL;DR: This paper introduces three novel ranking algorithms for signed networks and compares their ability in predicting signs of edges with already existing ones and identifies a number of ranking algorithms that result in higher prediction accuracy compared to others.
Abstract: Social networks are inevitable part of modern life. A class of social networks is those with both positive (friendship or trust) and negative (enmity or distrust) links. Ranking nodes in signed networks remains a hot topic in computer science. In this manuscript, we review different ranking algorithms to rank the nodes in signed networks, and apply them to the sign prediction problem. Ranking scores are used to obtain reputation and optimism, which are used as features in the sign prediction problem. Reputation of a node shows patterns of voting towards the node and its optimism demonstrates how optimistic a node thinks about others. To assess the performance of different ranking algorithms, we apply them on three signed networks including Epinions, Slashdot and Wikipedia. In this paper, we introduce three novel ranking algorithms for signed networks and compare their ability in predicting signs of edges with already existing ones. We use logistic regression as the predictor and the reputation and optimism values for the trustee and trustor as features (that are obtained based on different ranking algorithms). We find that ranking algorithms resulting in correlated ranking scores, leads to almost the same prediction accuracy. Furthermore, our analysis identifies a number of ranking algorithms that result in higher prediction accuracy compared to others.

85 citations


Journal ArticleDOI
TL;DR: This paper studies the sentiment prediction task over Twitter using machine-learning techniques, with the consideration of Twitter-specific social network structure such as retweet, and combined the results of sentiment analysis with the influence factor generated from the retweet count to improve the prediction accuracy.
Abstract: Online microblog-based social networks have been used for expressing public opinions through short messages. Among popular microblogs, Twitter has attracted the attention of several researchers in areas like predicting the consumer brands, democratic electoral events, movie box office, popularity of celebrities, the stock market, etc. Sentiment analysis over a Twitter-based social network offers a fast and efficient way of monitoring the public sentiment. This paper studies the sentiment prediction task over Twitter using machine-learning techniques, with the consideration of Twitter-specific social network structure such as retweet. We also concentrate on finding both direct and extended terms related to the event and thereby understanding its effect. We employed supervised machine-learning techniques such as support vector machines (SVM), Naive Bayes, maximum entropy and artificial neural networks to classify the Twitter data using unigram, bigram and unigram + bigram (hybrid) feature extraction model for the case study of US Presidential Elections 2012 and Karnataka State Assembly Elections (India) 2013. Further, we combined the results of sentiment analysis with the influence factor generated from the retweet count to improve the prediction accuracy of the task. Experimental results demonstrate that SVM outperforms all other classifiers with maximum accuracy of 88 % in predicting the outcome of US Elections 2012, and 68 % for Indian State Assembly Elections 2013.

59 citations


Journal ArticleDOI
TL;DR: This work conducts a series of experiments on social interaction networks from Twitter, Flickr, and BibSonomy and investigates the relatedness concerning the interactions, their frequency, and the specific interaction characteristics, supporting the social distributional hypothesis.
Abstract: Applications of the Social Web are ubiquitous and have become an integral part of everyday life: Users make friends, for example, with the help of online social networks, share thoughts via Twitter, or collaboratively write articles in Wikipedia. All such interactions leave digital traces; thus, users participate in the creation of heterogeneous, distributed, collaborative data collections. In linguistics, the Distributional Hypothesis states that words with similar distributional characteristics tend to be semantically related, i.e., words which occur in similar contexts are assumed to have a similar meaning. Considering users as (social) entities, their distributional characteristics can be observed by collecting interactions in social web applications. Accordingly, we state the social distributional hypothesis: we presume, that users with similar interaction characteristics tend to be related. We conduct a series of experiments on social interaction networks from Twitter, Flickr, and BibSonomy and investigate the relatedness concerning the interactions, their frequency, and the specific interaction characteristics. The results indicate interrelations between structurally similarity of interaction characteristics and semantic relatedness of users, supporting the social distributional hypothesis.

55 citations


Journal ArticleDOI
TL;DR: This article proposes a computational model that integrates data mining with social computing to help users discover influential friends from a specific portion of the social networks that they are interested in and allows users to interactively change their mining parameters.
Abstract: Social networks, which are made of social entities (eg, individual users) linked by some specific types of interdependencies such as friendship, have become popular to facilitate collaboration and knowledge sharing among users Such interactions or interdependencies can be dependent on or influenced by user characteristics such as connectivity, centrality, weight, importance, and activity in the networks As such, some users in the social networks can be considered as highly influential to others In this article, we propose a computational model that integrates data mining with social computing to help users discover influential friends from a specific portion of the social networks that they are interested in Moreover, our social network analysis and mining model also allows users to interactively change their mining parameters (eg, scopes of their interested portions of the social networks)

54 citations


Journal ArticleDOI
TL;DR: Three models to extract reliability of similarities estimated in classic recommenders are proposed and it is shown that employing resource allocation in classical recommenders significantly improves their performance.
Abstract: Recommendation systems are important part of electronic commerce, where appropriate items are recommended to potential users. The most common algorithms used for constructing recommender systems in commercial applications are collaborative filtering methods and their variants, which is mainly due to their simple implementation. In these methods, structural features of bipartite network of users and items are used and potential items are recommended to the users based on a similarity measure that shows how similar the behavior of the users is. Indeed, the performance of the memory-based CF algorithms heavily depends on the quality of similarities obtained among users/items. As the obtained similarities are more reliable, better performance for the recommender systems is expected. In this paper, we propose three models to extract reliability of similarities estimated in classic recommenders. We incorporate the obtained reliabilities to improve performance of the recommender systems. In the proposed algorithms for reliability extraction, a number of elements are taken into account including the structure of the user-item bipartite network, the individual profile of the users, i.e., how many items they have rated, and that of the items, i.e., how many users have rated them. Among the proposed methods, the method based on resource allocation provides the highest performance as compared to others. Our numerical results on two benchmark datasets (Movielens and Netflix) shows that employing resource allocation in classical recommenders significantly improves their performance. These results are of great importance since including resource allocation in the systems does not increase their computational complexity.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed the first centrality methods specifically conceived for ranking lurkers in social networks, which utilizes only the network topology without probing into text contents or user relationships related to media.
Abstract: The massive presence of silent members in online communities, the so-called lurkers, has long attracted the attention of researchers in social science, cognitive psychology, and computer–human interaction. However, the study of lurking phenomena represents an unexplored opportunity of research in data mining, information retrieval and related fields. In this paper, we take a first step towards the formal specification and analysis of lurking in social networks. We address the new problem of lurker ranking and propose the first centrality methods specifically conceived for ranking lurkers in social networks. Our approach utilizes only the network topology without probing into text contents or user relationships related to media. Using Twitter, Flickr, FriendFeed and GooglePlus as cases in point, our methods’ performance was evaluated against data-driven rankings as well as existing centrality methods, including the classic PageRank and alpha-centrality. Empirical evidence has shown the significance of our lurker ranking approach, and its uniqueness in effectively identifying and ranking lurkers in an online social network.

Journal ArticleDOI
TL;DR: This investigation looks at the dynamics of hierarchical discussion threads of Reddit, and asks to what extent do discussion threads resemble a topical hierarchy?
Abstract: Social news and content aggregation Web sites have become massive repositories of valuable knowledge on a diverse range of topics. Millions of Web-users are able to leverage these platforms to submit, view and discuss nearly anything. The users themselves exclusively curate the content with an intricate system of submissions, voting and discussion. Furthermore, the data on social news Web sites is extremely well organized by the user-base, which, like in Wikipedia, opens the door for opportunities to leverage this data for other purposes. In this paper, we study a popular social news Web site called Reddit. Our investigation looks at the dynamics of hierarchical discussion threads, and we ask three questions: (1) to what extent do discussion threads resemble a topical hierarchy? (2) Can discussion threads be used to enhance Web search? and (3) what variables are the best predictors for high scoring comments? We show interesting results for these questions on a very large snapshot several sub-communities of the Reddit Web site. Finally, we discuss the implications of these results and suggest ways by which social news Web sites can be used to perform other tasks.

Journal ArticleDOI
TL;DR: The problem of online influence maximization is recast to a weighted maximum cut problem which analyzes the influence flow among graph vertices and an optimal seed set can be identified effectively by running a semidefinite program-based (GW) algorithm on a complete influence graph.
Abstract: Online social networks are becoming a true growth point of Internet. As individuals constantly desire to interact with each other, the ability for Internet to deliver this networking influence becomes much stronger. In this paper, we study the online social influence maximization problem, which is to find a small group of influential users that maximizes the spread of influence through networks. After a thorough analysis of existing models, especially two classical ones, namely Independent cascade and linear thresholds, we argue that their idea that each user can only be activated by its active neighbors is not applicable to online social networks, since in many applications there is no clear concept for the issue of "activation". In our proposed influence model, if there is a social influence path linking two nonadjacent individuals, the value of influence between these two individuals can be evaluated along the existing social path based on the influence transitivity property under certain constraints. To compute influence probabilities between two neighbors, we also develop a new method which leverages both structure of networks and historical data. With reasonably learned influence probabilities, we recast the problem of online influence maximization to a weighted maximum cut problem which analyzes the influence flow among graph vertices. By running a semidefinite program-based (GW) algorithm on a complete influence graph, an optimal seed set can be identified effectively. We also provide experimental results on real online social networks, showing that our algorithm significantly outperforms greedy methods.

Journal ArticleDOI
Daniel Schall1
TL;DR: This work advances existing techniques and proposes mining of subgraph patterns that are used to predict links in networks such as GitHub, GooglePlus, and Twitter, and shows that the proposed metrics and techniques yield more accurate predictions when compared with metrics not accounting for the directed nature of the underlying networks.
Abstract: In today’s online social networks, it becomes essential to help newcomers as well as existing community members to find new social contacts. In scientific literature, this recommendation task is known as link prediction. Link prediction has important practical applications in social network platforms. It allows social network platform providers to recommend friends to their users. Another application is to infer missing links in partially observed networks. The shortcoming of many of the existing link prediction methods is that they mostly focus on undirected graphs only. This work closes this gap and introduces link prediction methods and metrics for directed graphs. Here, we compare well-known similarity metrics and their suitability for link prediction in directed social networks. We advance existing techniques and propose mining of subgraph patterns that are used to predict links in networks such as GitHub, GooglePlus, and Twitter. Our results show that the proposed metrics and techniques yield more accurate predictions when compared with metrics not accounting for the directed nature of the underlying networks.

Journal ArticleDOI
TL;DR: This paper introduces novel features based on a person’s social actions in general, towards specific individuals in particular and shows how behavioral features can be customized to a specific medium and personality trait.
Abstract: In this paper, we study the problem of predicting personality with features based on social behavior. While network position and text analysis are often used in personality prediction, the use of social behavior is fairly new. Often studies of social behavior either concentrate on a single behavior or trait, or simply use behavior to predict ties that are then used in analysis of network position. To study this problem, we introduce novel features based on a person’s social actions in general, towards specific individuals in particular. We also compute the variation of these actions among all the social contacts of a person as well as the actions of friends. We show that social behavior alone, without the help of any textual or network position information, provides a good basis for personality prediction. We then provide a unique comparative study that finds the most significant features based on social behavior in predicting personality for three different communication mediums: Twitter, SMS and phone calls. These mediums offer us with social behavior from public and private contexts, containing messaging and voice call type exchanges. We find behaviors that are distinctive and normative among the ones we study. We also illustrate how behavioral features relate to different personality traits. We also show the various similarities and differences between different mediums in terms of social behavior. Note that all behavioral features are based on statistical properties of the number and the time of social actions and do not consider the textual content. As a result, they can be applied in many different settings. Furthermore, our findings show us how behavioral features can be customized to a specific medium and personality trait.

Journal ArticleDOI
TL;DR: In this paper, a streaming framework for online detection and clustering of memes in social media, specifically Twitter, is described, where a pre-clustering procedure, namely protomeme detection, first isolates atomic tokens of information carried by the tweets and then aggregates them to obtain memes as cohesive groups of tweets reflecting actual concepts or topics of discussion.
Abstract: The problem of clustering content in social media has pervasive applications, including the identification of discussion topics, event detection, and content recommendation. Here, we describe a streaming framework for online detection and clustering of memes in social media, specifically Twitter. A pre-clustering procedure, namely protomeme detection, first isolates atomic tokens of information carried by the tweets. Protomemes are thereafter aggregated, based on multiple similarity measures, to obtain memes as cohesive groups of tweets reflecting actual concepts or topics of discussion. The clustering algorithm takes into account various dimensions of the data and metadata, including natural language, the social network, and the patterns of information diffusion. As a result, our system can build clusters of semantically, structurally, and topically related tweets. The clustering process is based on a variant of Online K-means that incorporates a memory mechanism, used to “forget” old memes and replace them over time with the new ones. The evaluation of our framework is carried out using a dataset of Twitter trending topics. Over a 1-week period, we systematically determined whether our algorithm was able to recover the trending hashtags. We show that the proposed method outperforms baseline algorithms that only use content features, as well as a state-of-the-art event detection method that assumes full knowledge of the underlying follower network. We finally show that our online learning framework is flexible, due to its independence of the adopted clustering algorithm, and best suited to work in a streaming scenario.

Journal ArticleDOI
TL;DR: This work provides a method to detect social capitalists, a special kind of users in Twitter that act like automatic accounts, and shows that these users form a highly connected group in the network by studying their neighborhoods and their local clustering coefficient.
Abstract: In this paper, we focus on the detection and behavior of social capitalists, a special kind of users in Twitter. Roughly speaking, social capitalists follow users regardless of their contents, just hoping to increase their number of followers. They have first been introduced by Ghosh et al. (Proceedings of the 21st international conference on World Wide Web, WWW’12, pp. 61–70, 2012). In this work, we provide a method to detect these users efficiently. Our algorithms do not rely on the tweets posted by the users, just on the topology of the Twitter graph. Then, we show that these users form a highly connected group in the network by studying their neighborhoods and their local clustering coefficient (Watts and Strogatz, Nature 393 (6684):440–442, 1998). We next study the evolution of such users between 2009 and 2013. Finally, we provide a behavioral analysis based on social capitalists that tweet on a special hashtag. Our work emphasizes that such users, who act like automatic accounts, are in fact for most of them real users.

Journal ArticleDOI
TL;DR: A general framework to estimate network properties using random walks under the assumption that a vertex is able to obtain local characteristics of a vertex during each step of the random walk, for example the number of its neighbours, and their labels is presented.
Abstract: Sampling from large graphs is an area of great interest, especially since the emergence of huge structures such as Online Social Networks and the World Wide Web (WWW). The large scale properties of a network can be summarized in terms of parameters of the underlying graph, such as the total number of vertices, edges and triangles. However, the large si ze of these networks makes it computationally expensive to obtain such structural properties of the underlying graph by exhaustive search. If we can estimate these properties by taking small but representative samples from the network, then size is no longer such a problem. In this paper we present a general framework to estimate network properties using random walks. These methods work under the assumption we are able to obtain local characteristics of a vertex during each step of the random walk, for example the number of its neighbours, and their labels. As examples of this approach, we present practical methods to estimate the total number of edges/links m, number of vertices/nodes n and number of connected triads of vertices (triangles) t in graphs. We also give a general method to count any type of small connected subgraph, of which vertices, edges and triangles are specific examples. Additionally we present experimental estimates for n, m, t we obtained using our methods on real or synthetic networks. The synthetic networks were random graphs with power-law degree distributions and designed to have a large number of triangles. We used these graphs as they tend to correspond to the structure of large online networks. The real networks were samples of the WWW and social networks obtained from the SNAP database. In order to test that the methods are indeed practical, the total number of steps made by the walk was limited to at most the size n of the network. In fact the estimates appear to converge to the correct value at a lower number of steps, indicating that our proposed methods are feasible in practice.

Journal ArticleDOI
TL;DR: A unified framework for coloring large complex networks that consists of two main coloring variants that effectively balances the tradeoff between accuracy and efficiency is developed and is shown to be accurate with solutions close to optimal, fast and scalable for large networks, and flexible for use in a variety of applications.
Abstract: Given a large social or information network, how can we partition the vertices into sets (i.e., colors) such that no two vertices linked by an edge are in the same set while minimizing the number of sets used. Despite the obvious practical importance of graph coloring, existing works have not systematically investigated or designed methods for large complex networks. In this work, we develop a unified framework for coloring large complex networks that consists of two main coloring variants that effectively balances the tradeoff between accuracy and efficiency. Using this framework as a fundamental basis, we propose coloring methods designed for the scale and structure of complex networks. In particular, the methods leverage triangles, triangle-cores, and other egonet properties and their combinations. We systematically compare the proposed methods across a wide range of networks (e.g., social, web, biological networks) and find a significant improvement over previous approaches in nearly all cases. Additionally, the solutions obtained are nearly optimal and sometimes provably optimal for certain classes of graphs (e.g., collaboration networks). We also propose a parallel algorithm for the problem of coloring neighborhood subgraphs and make several key observations. Overall, the coloring methods are shown to be (1) accurate with solutions close to optimal, (2) fast and scalable for large networks, and (3) flexible for use in a variety of applications.

Journal ArticleDOI
TL;DR: This work proposes a value-allocation model to compute the social capital and allocate the fair share of this capital to each individual involved in the collaboration and shows that the allocation satisfies several axioms of fairness and falls in the same class as the Myerson's allocation function.
Abstract: The existing methods for finding influencers use the process of information diffusion to discover the nodes with the maximum expected information spread. These models capture only the process of information diffusion and not the actual social value of collaborations in the network. We have proposed a method for finding influencers based on the notion that people generate more value for their work by collaborating with peers of high influence. The social value generated through such collaborations denotes the individual social capital. We hypothesize and show that players with high individual social capital are often key influencers in the network. We propose a value-allocation model to compute the social capital and allocate the fair share of this capital to each individual involved in the collaboration. We show that our allocation satisfies several axioms of fairness and falls in the same class as the Myerson’s allocation function. We implement our allocation rule using an efficient algorithm SoCap and show that our algorithm outperforms the baselines in several real-life data sets. Specifically, in DBLP network, our algorithm outperforms PageRank, PMIA and Weighted Degree baselines up to 8 % in terms of precision, recall and $$F_1$$ -measure.

Journal ArticleDOI
TL;DR: The proposed framework SPADE has numerous benefits including accuracy of spam detection will be improved through cross-domain classification and associative classification; other techniques can be integrated and centralized; and new social networks can plug into the system easily, preventing spam at an early stage.
Abstract: Social media such as Facebook, MySpace, and Twitter have become increasingly important for attracting millions of users. Consequently, spammers are increasing using such networks for propagating spam. Although existing filtering techniques such as collaborative filters and behavioral analysis filters are able to significantly reduce spam, each social network needs to build its own independent spam filter and support a spam team to keep spam prevention techniques current. To alleviate those problems, we propose a framework for spam analytics and detection which can be used across all social network sites. Specifically, the proposed framework SPADE has numerous benefits including (1) new spam detected on one social network can quickly be identified across social networks; (2) accuracy of spam detection will be improved through cross-domain classification and associative classification; (3) other techniques (such as blacklists and message shingling) can be integrated and centralized; (4) new social networks can plug into the system easily, preventing spam at an early stage. In SPADE, we present a uniform schema model to allow cross-social network integration. In this paper, we define the user, message, and web page model. Moreover, we provide an experimental study of real datasets from social networks to demonstrate the flexibility and feasibility of our framework. We extensively evaluated two major classification approaches in SPADE: cross-domain classification and associative classification. In cross-domain classification, SPADE achieved over 0.92 F-measure and over 91 % detection accuracy on web page model using Naive Bayes classifier. In associative classification, SPADE also achieved 0.89 F-measure on message model and 0.87 F-measure on user profile model, respectively. Both detection accuracies are beyond 85 %. Based on those results, our SPADE has been demonstrated to be a competitive spam detection solution to social media.

Journal ArticleDOI
TL;DR: It is shown that although more than 80 % of all friendships on Twitter are created due to data interests, 83 %, all users have at least one friendship that can be explained neither by users' past interest nor collective behavior of other similar users.
Abstract: As the popularity and usage of social media exploded over the years, understanding how social network users’ interests evolve gained importance in diverse fields, ranging from sociological studies to marketing. In this paper, we use two snapshots from the Twitter network and analyze data interest patterns of users in time to understand individual and collective user behavior on social networks. Building topical profiles of users, we propose novel metrics to identify anomalous friendships, and validate our results with Amazon Mechanical Turk experiments. We show that although more than 80 % of all friendships on Twitter are created due to data interests, 83 % of all users have at least one friendship that can be explained neither by users’ past interest nor collective behavior of other similar users.

Journal ArticleDOI
TL;DR: This work proposes for the first time a suite of metrics that can be used to perform post-hoc analysis of the temporal communities of a large-scale citation network of the computer science domain, and quantifies the impact of a field, the influence imparted by one field on the other, the distribution of the “star” papers and authors, the degree of collaboration and seminal publications to characterize such research trends.
Abstract: In this work, we propose for the first time a suite of metrics that can be used to perform post-hoc analysis of the temporal communities of a large-scale citation network of the computer science domain Each community refers to a particular research field in this network, and therefore, they act as natural sub-groupings of this network (ie, ground-truths) The interactions between these ground-truth communities through citations over the real time naturally unfold the evolutionary landscape of the dynamic research trends in computer science These interactions are quantified in terms of a metric called inwardness that captures the effect of local citations to express the degree of authoritativeness of a community (research field) at a particular time instance In particular, we quantify the impact of a field, the influence imparted by one field on the other, the distribution of the “star” papers and authors, the degree of collaboration and seminal publications to characterize such research trends In addition, we tear the data into three subparts representing the continents of North America, Europe and the rest of the world, and analyze how each of them influences one another as well as the global dynamics We point to how the results of our analysis correlate with the project funding decisions made by agencies like NSF We believe that this measurement study with a large real-world data is an important initial step towards understanding the dynamics of cluster-interactions in a temporal environment Note that this paper, for the first time, systematically outlines a new avenue of research that one can practice post community detection

Journal ArticleDOI
TL;DR: Support vector machines (SVM) shows that supervised learning procedure increases the performance of the content-based recommender systems; it yields best results in terms of AUC in comparison with other investigated methodologies such as Matrix Factorization and Collaborative Topic Regression.
Abstract: This paper presents content-based recommender systems which propose relevant jobs to Facebook and LinkedIn users. These systems have been developed at Work4, the Global Leader in Social and Mobile Recruiting. The profile of a social network user contains two types of data: user data and user friend data; furthermore, the profile of our users and the description of our jobs consist of text fields. The first experiments suggest that to predict the interests of users for jobs using basic similarity measures together with data collected by Work4 can be improved upon. The next experiments then propose a method to estimate the importance of users’ and jobs’ different fields in the task of job recommendation; taking into account these weights allow us to significantly improve the recommendations. The third part of this paper analyzes social recommendation approaches, validating the suitability for job recommendations for Facebook and LinkedIn users. The last experiments focus on machine learning algorithms to improve the obtained results with basic similarity measures. Support vector machines (SVM) shows that supervised learning procedure increases the performance of our content-based recommender systems; it yields best results in terms of AUC in comparison with other investigated methodologies such as Matrix Factorization and Collaborative Topic Regression.

Journal ArticleDOI
TL;DR: In this article, the mismatch between hyperlinks and clickstreams is quantified by applying network science to publicly available hyperlink and click-stream data for approximately 1,000 of the top Web sites.
Abstract: The core of the Web is a hyperlink navigation system collaboratively set up by webmasters to help users find desired information. While it is well known that search engines are important for navigation, the extent to which search has led to a mismatch between hyperlinks and the pathways that users actually take has not been quantified. By applying network science to publicly available hyperlink and clickstream data for approximately 1,000 of the top Web sites, we show that the mismatch between hyperlinks and clickstreams is indeed substantial. We demonstrate that this mismatch has arisen because webmasters attempt to build a global virtual world without geographical or cultural boundaries, but users in fact prefer to navigate within more fragmented, language-based groups of Web sites. We call this type of behavior “preferential navigation” and find that it is driven by “local” search engines.

Journal ArticleDOI
TL;DR: This work uses the variant of the Apriori algorithm called APPM to mine users’ influence paths and is the first attempt to use the action-specific influence chains for mining community leaders from a social network.
Abstract: This paper proposes an approach to discover community leaders in a social network by means of a probabilistic time-based graph propagation model. In the study, community leaders are defined as those people who initiate a lot of influence chains according to the number of chains they belong to. To conduct the approach, we define an exponential time-decay function to measure the influence of leaders and construct the chains of leaders’ action-specific influence. Then, we build the general chains by normalizing over all possible users’ actions. In specific, this paper uses the variant of the Apriori algorithm called APPM to mine users’ influence paths. To the best of our knowledge, this work is the first attempt to use the action-specific influence chains for mining community leaders from a social network. In our experiments, two datasets are collected to examine the performance of the proposed algorithm: one is a social network dataset from Facebook consisting of 134 nodes and 517 edges, and another is a citation network dataset from DBLP and Google Scholar containing 2,525 authors, 3,240 papers, and 1,914 citations. In addition, several baselines are also carried out for comparison, including three naive and one user-involved approaches. The experimental results show that, compared with the baselines, for the task of detecting community leaders, the proposed method outperforms the baselines, and our method also obtains good performance on ranking community leaders.

Journal ArticleDOI
TL;DR: A novel method that predicts new social links that can be inserted among existing users of a social network, aiming directly at boosting information spread and increasing its reach is proposed, suitable for real-world applications.
Abstract: Social media have become popular platforms for spreading information. Several applications, such as ‘viral marketing’, pause the requirement for attaining large-scale information spread in the form of word-of-mouth that reaches a large number of users. In this paper, we propose a novel method that predicts new social links that can be inserted among existing users of a social network, aiming directly at boosting information spread and increasing its reach. We refer to this task as ‘link injection’, because unlike most existing people-recommendation methods, it focuses directly on information spread. A set of candidate links for injection is first predicted in a collaborative-filtering fashion, which generates personalized candidate connections. We select among the candidate links a constrained number that will be finally injected based on a novel application of a score that measures the importance of nodes in a social graph, following the strategy of injecting links adjacent to the most important nodes. The proposed method is suitable for real-world applications, because the injected links manage to substantially increase the reach of information spread by controlling at the same time the number of injected links not to affect the user experience. We evaluate the performance of our proposed methodology by examining several real data sets from social networks under several distinct factors. The experimentation demonstrates the effectiveness of our proposed method, which increases the spread by more than a twofold factor by injecting as few as half of the existing number of links.

Journal ArticleDOI
TL;DR: This work presents the first academic, large-scale exploratory study of insider filings and related data, based on the complete Form 4 fillings from the U.S. Securities and Exchange Commission, to help financial regulators and policymakers understand the dynamics of the trades, and enable them to adapt their detection strategies toward these dynamics.
Abstract: How do company insiders trade? Do their trading behaviors differ based on their roles (e.g., chief executive officer vs. chief financial officer)? Do those behaviors change over time (e.g., impacted by the 2008 market crash)? Can we identify insiders who have similar trading behaviors? And what does that tell us? This work presents the first academic, large-scale exploratory study of insider filings and related data, based on the complete Form 4 fillings from the U.S. Securities and Exchange Commission. We analyze 12 million transactions by 370 thousand insiders spanning 1986–2012, the largest reported in academia. We explore the temporal and network-based aspects of the trading behaviors of insiders, and make surprising and counterintuitive discoveries. We study how the trading behaviors of insiders differ based on their roles in their companies, the types of their transactions, their companies’ sectors, and their relationships with other insiders. Our work raises exciting research questions and opens up many opportunities for future studies. Most importantly, we believe our work could form the basis of novel tools for financial regulators and policymakers to detect illegal insider trading, help them understand the dynamics of the trades, and enable them to adapt their detection strategies toward these dynamics.

Journal ArticleDOI
TL;DR: This is the first effort to introduce a practical solution for digital recruitment campaigns that is large-scale, inexpensive, efficient and reaches out to individuals in near real-time as their needs are expressed.
Abstract: Digital recruitment is increasingly becoming a popular avenue for identifying human subjects for various studies. The process starts with an online ad that describes the task and explains expectations. As social media has exploded in popularity, efforts are being made to use social media advertisement for various recruitment purposes. There are, however, many unanswered questions about how best to do that. In this paper, we present an innovative Twitter recruitment system for a smoking cessation nicotine patch study. The goals of the paper are to: (1) present the approach we have taken to solve the problem of digital recruitment; (2) provide the system specification and design of a rule-based system; (3) present the algorithms and data mining approaches (classification and association analysis) using Twitter data; and (4) present the promising outcome of the initial version of the system and summarize the results. This is the first effort to introduce a practical solution for digital recruitment campaigns that is large-scale, inexpensive, efficient and reaches out to individuals in near real-time as their needs are expressed. A continuous update on how our system is performing, in real-time, can be viewed at https://twitter.com/TobaccoQuit .