scispace - formally typeset
Search or ask a question

Showing papers in "Social Network Analysis and Mining in 2019"


Journal ArticleDOI
TL;DR: A large dataset of geotagged tweets containing certain keywords relating to climate change is analyzed using volume analysis and text mining techniques such as topic modeling and sentiment analysis to compare and contrast the nature of climate change discussion between different countries and over time.
Abstract: Social media websites can be used as a data source for mining public opinion on a variety of subjects including climate change. Twitter, in particular, allows for the evaluation of public opinion across both time and space because geotagged tweets include timestamps and geographic coordinates (latitude/longitude). In this study, a large dataset of geotagged tweets containing certain keywords relating to climate change is analyzed using volume analysis and text mining techniques such as topic modeling and sentiment analysis. Latent Dirichlet allocation was applied for topic modeling to infer the different topics of discussion, and Valence Aware Dictionary and sEntiment Reasoner was applied for sentiment analysis to determine the overall feelings and attitudes found in the dataset. These techniques are used to compare and contrast the nature of climate change discussion between different countries and over time. Sentiment analysis shows that the overall discussion is negative, especially when users are reacting to political or extreme weather events. Topic modeling shows that the different topics of discussion on climate change are diverse, but some topics are more prevalent than others. In particular, the discussion of climate change in the USA is less focused on policy-related topics than other countries.

151 citations


Journal ArticleDOI
TL;DR: The main aim of the work is to process the raw sentence from the Twitter dataset and find the actual polarity of the message and the proposed model performs well in normalization and sentiment analysis of the raw Twitter data enriched with hidden information.
Abstract: On social media platforms such as Twitter and Facebook, people express their views, arguments, and emotions of many events in daily life. Twitter is an international microblogging service featuring short messages called “tweets” from different languages. These texts often consist of noise in the form of incorrect grammar, abbreviations, freestyle, and typographical errors. Sentiment analysis (SA) aims to predict the actual emotions from the raw text expressed by the people through the field of natural language processing (NLP). The main aim of our work is to process the raw sentence from the Twitter dataset and find the actual polarity of the message. This paper proposes a text normalization with deep convolutional character level embedding (Conv-char-Emb) neural network model for SA of unstructured data. This model can tackle the problems: (1) processing the noisy sentence for sentiment detection (2) handling small memory space in word level embedded learning (3) accurate sentiment analysis of the unstructured data. The initial preprocessing stage for performing text normalization includes the following steps: tokenization, out of vocabulary (OOV) detection and its replacement, lemmatization and stemming. A character-based embedding in convolutional neural network (CNN) is an effective and efficient technique for SA that uses less learnable parameters in feature representation. Thus, the proposed method performs both the normalization and classification of sentiments for unstructured sentences. The experimental results are evaluated in the Twitter dataset by a different point polarity (positive, negative and neutral). As a result, our model performs well in normalization and sentiment analysis of the raw Twitter data enriched with hidden information.

73 citations


Journal ArticleDOI
TL;DR: This work introduces a corpus of forty thousand labeled Arabic tweets spanning several topics and presents three deep learning models, namely CNN, LSTM and RCNN, for Arabic sentiment analysis, with the help of word embedding.
Abstract: Social media are considered an excellent source of information and can provide opinions, thoughts and insights toward various important topics. Sentiment analysis becomes a hot topic in research due to its importance in making decisions based on opinions derived from analyzing the user’s contents on social media. Although the Arabic language is one of the widely spoken languages used for content sharing across the social media, the sentiment analysis on Arabic contents is limited due to several challenges including the morphological structures of the language, the varieties of dialects and the lack of the appropriate corpora. Hence, the rapid increase in research in Arabic sentiment analysis is grown slowly in contrast to other languages such as English. The contribution of this paper is twofold: First, we introduce a corpus of forty thousand labeled Arabic tweets spanning several topics. Second, we present three deep learning models, namely CNN, LSTM and RCNN, for Arabic sentiment analysis. With the help of word embedding, we validate the performance of the three models on the proposed corpus. The experimental results indicate that LSTM with an average accuracy of 81.31% outperforms CNN and RCNN. Also, applying data augmentation on the corpus increases LSTM accuracy by 8.3%.

62 citations


Journal ArticleDOI
TL;DR: In this article, the authors explore the effects of Russian manipulation campaign on the 2016 U.S. election campaign, taking a closer look at users who re-shared the posts produced on Twitter by the Russian troll accounts publicly disclosed by U. S. Congress investigation, revealing that conservative trolls talk about refugees, terrorism, and Islam, while liberal trolls talk more about school shootings and the police.
Abstract: Until recently, social media were seen to promote democratic discourse on social and political issues. However, this powerful communication ecosystem has come under scrutiny for allowing hostile actors to exploit online discussions in an attempt to manipulate public opinion. A case in point is the ongoing U.S. Congress investigation of Russian interference in the 2016 U.S. election campaign, with Russia accused of, among other things, using trolls (malicious accounts created for the purpose of manipulation) and bots (automated accounts) to spread propaganda and politically biased information. In this study, we explore the effects of this manipulation campaign, taking a closer look at users who re-shared the posts produced on Twitter by the Russian troll accounts publicly disclosed by U.S. Congress investigation. We collected a dataset of 13 million election-related posts shared on Twitter in the year of 2016 by over a million distinct users. This dataset includes accounts associated with the identified Russian trolls as well as users sharing posts in the same time period on a variety of topics around the 2016 elections. We use label propagation to infer the users’ ideology based on the news sources they share. We are able to classify a large number of the users as liberal or conservative with precision and recall above 84%. Conservative users who retweet Russian trolls produced significantly more tweets than liberal ones, about 8 times as many in terms of tweets. Additionally, trolls’ position in the retweet network is stable overtime, unlike users who retweet them who form the core of the election-related retweet network by the end of 2016. Using state-of-the-art bot detection techniques, we estimate that about 5% and 11% of liberal and conservative users are bots, respectively. Text analysis on the content shared by trolls reveal that conservative trolls talk about refugees, terrorism, and Islam, while liberal trolls talk more about school shootings and the police. Although an ideologically broad swath of Twitter users were exposed to Russian trolls in the period leading up to the 2016 U.S. Presidential election, it is mainly conservatives who help amplify their message.

58 citations


Journal ArticleDOI
TL;DR: The main goal of this paper is to give a comprehensive survey of community detection algorithms in social graphs based on the computational nature (either centralized or distributed) and thus in static and dynamic social networks.
Abstract: Community detection is an important research area in social networks analysis where we are concerned with discovering the structure of the social network. Detecting communities is of great importance in sociology, biology and computer science, disciplines where systems are often represented as graphs. This problem is an NP-hard problem and not yet solved to a satisfactory level. This computational complexity is hampered by two major factors. The first factor is related to the huge size of nowadays social networks like Facebook and Twitter reaching billions of nodes. The second factor is related to the dynamic nature of social networks whose structure evolves over time. For this, community detection in social networks analysis is gaining increasing attention in the scientific community and a lot of research was done in this area. The main goal of this paper is to give a comprehensive survey of community detection algorithms in social graphs. For this, we provide a taxonomy of existing models based on the computational nature (either centralized or distributed) and thus in static and dynamic social networks. In addition, we provide a comprehensive overview of existing applications of community detection in social networks. Finally, we provide further research directions as well as some open challenges.

57 citations


Journal ArticleDOI
TL;DR: Experimental results show that the proposed strategies are more effective than classical alternatives that are agnostic of the community structure and outperform alternative local and global strategies designed for modular networks.
Abstract: Although community structure is ubiquitous in complex networks, few works exploit this topological property to control epidemics. In this work, devoted to networks with non-overlapping community structure (i.e., a node belongs to a single community), we propose and investigate three global immunization strategies. In order to characterize the influence of a node, various pieces of information are used such as the number of communities that the node can reach in one hop, the nature of the links (intra-community links, inter-community links), the size of the communities and the interconnection density between communities. Numerical simulations with the susceptible-infected-removed epidemiological model are conducted on both real-world and synthetic networks. Experimental results show that the proposed strategies are more effective than classical alternatives that are agnostic of the community structure. Additionally, they outperform alternative local and global strategies designed for modular networks.

56 citations


Journal ArticleDOI
TL;DR: This work creates the first public Arabic dataset of tweets annotated for religious hate speech detection and creates three public Arabic lexicons of terms related to religion along with hate scores, and presents a thorough analysis of the labeled dataset.
Abstract: Religious hatred is a serious problem on Arabic Twitter space and has the potential to ignite terrorism and hate crimes beyond cyber space. To the best of our knowledge, this is the first research effort investigating the problem of recognizing Arabic tweets using inflammatory and dehumanizing language to promote hatred and violence against people on the basis of religious beliefs. In this work, we create the first public Arabic dataset of tweets annotated for religious hate speech detection. We also create three public Arabic lexicons of terms related to religion along with hate scores. We then present a thorough analysis of the labeled dataset, reporting most targeted religious groups and hateful and non-hateful tweets’ country of origin. The labeled dataset is then used to train seven classification models using lexicon-based, n-gram-based, and deep-learning-based approaches. These models are evaluated on new unseen dataset to assess the generalization ability of the developed classifiers. While using Gated Recurrent Units with pre-trained word embeddings provides best precision (0.76) and $$F_1$$ score (0.77), training that same neural network on additional temporal, users, and content features provides the state-of-the-art performance in terms of recall (0.84).

40 citations


Journal ArticleDOI
TL;DR: A network-aware privacy score is defined that improves the measurement of user privacy risk according to the characteristics of the network and assumes that users that lie in an unsafe portion of thenetwork are more at risk than users that are mostly surrounded by privacy-aware friends.
Abstract: Online social networks expose their users to privacy leakage risks. To measure the risk, privacy scores can be computed to quantify the users’ profile exposure according to their privacy preferences or attitude. However, user privacy can be also influenced by external factors (e.g., the relative risk of the network, the position of the user within the social graph), but state-of-the-art scores do not consider such properties adequately. We define a network-aware privacy score that improves the measurement of user privacy risk according to the characteristics of the network. We assume that users that lie in an unsafe portion of the network are more at risk than users that are mostly surrounded by privacy-aware friends. The effectiveness of our measure is analyzed by means of extensive experiments on two simulated networks and a large graph of real social network users.

30 citations


Journal ArticleDOI
TL;DR: This survey presents recent work on sentiment analysis in Arabic and describes emergent trends related to Arabic sentiment analysis, principally associated with the use of deep learning techniques.
Abstract: To determine whether a document or a sentence expresses a positive or negative sentiment, three main approaches are commonly used: the lexicon-based approach, corpus-based approach, and a hybrid approach. The study of sentiment analysis in English has the highest number of sentiment analysis studies, while research is more limited for other languages, including Arabic and its dialects. Lexicon based approaches need annotated sentiment lexicons (containing the valence and intensity of its terms and expressions). Corpus-based sentiment analysis requires annotated sentences. One of the significant problems related to the treatment of Arabic and its dialects is the lack of these resources. We present in this survey the most recent resources and advances that have been done for Arabic sentiment analysis. This survey presents recent work (where the majority of these works are between 2015 and 2019). These works are classified by category (survey work or contribution work). For contribution work, we focus on the construction of sentiment lexicon and corpus. We also describe emergent trends related to Arabic sentiment analysis, principally associated with the use of deep learning techniques.

28 citations


Journal ArticleDOI
TL;DR: This work conducted a systematic literature review of detecting false information and its role in decision making spread across online content and describes four deep learning and eight machine learning techniques for false information detection.
Abstract: This work presents a review of detecting false information and its role in decision making spread across online content. The authenticity of information is an emerging issue that affects society and individuals and has a negative impact on people’s decision-making capabilities. The purpose is to understand how different techniques can be used to address the challenge. The approach used for the identification of published articles between 2014 and 2018 is the systematic literature review in which 30 papers were identified and the relevant articles were selected by applying inclusion–exclusion criteria. This review classifies the false information, spreading on social media, into four types. Furthermore, we describe four deep learning and eight machine learning techniques for false information detection. The outcomes of this review will provide the researchers with an insight into the different types of false information, associated detection techniques, and the relationship between false information and decision making. In the field of false information detection, previous studies provided a review of the literature. However, we conducted a systematic literature review by providing specific answers to the proposed research questions. Therefore, our contribution is novel to the field because this type of study is not performed previously.

26 citations


Journal ArticleDOI
TL;DR: In this article, the authors use motifs to characterize the information diffusion process in social networks and study the lifecycle of information cascades to understand what leads to saturation of growth in terms of cascade reshares, thereby resulting in expiration.
Abstract: Network motifs are patterns of over-represented node interactions in a network which have been previously used as building blocks to understand various aspects of the social networks. In this paper, we use motif patterns to characterize the information diffusion process in social networks. We study the lifecycle of information cascades to understand what leads to saturation of growth in terms of cascade reshares, thereby resulting in expiration, an event we call “diffusion inhibition”. In an attempt to understand what causes inhibition, we use motifs to dissect the network obtained from information cascades coupled with traces of historical diffusion or social network links. Our main results follow from experiments on a dataset of cascades from the Weibo platform and the Flixster movie ratings. We observe the temporal counts of 5-node undirected motifs from the cascade temporal networks leading to the inhibition stage. Empirical evidences from the analysis lead us to conclude the following about stages preceding inhibition: (1) individuals tend to adopt information more from users they have known in the past through social networks or previous interactions thereby creating patterns containing triads more frequently than acyclic patterns with linear chains and (2) users need multiple exposures or rounds of social reinforcement for them to adopt an information and as a result information starts spreading slowly thereby leading to the death of the cascade. Following these observations, we use motif-based features to predict the edge cardinality of the network exhibited at the time of inhibition. We test features of motif patterns using regression models for both individual patterns and their combination and we find that motifs as features are better predictors of the future network organization than individual node centralities.

Journal ArticleDOI
TL;DR: A noble method to measure group qualities in Telegram is devised and several features extracted and studied are studied in this paper, which aims to provide a deeper insight into Telegram users’ behavior.
Abstract: Telegram is a cloud-based instant messenger with more than 200 million monthly active users. The new features introduced in Telegram such as channels, bots, supergroups, and advanced sharing mechanisms have risen up the instant messenger to a higher level. Telegram is now a new paradigm between social networks and instant messengers. Telegram is very popular and growing rapidly in many countries such as Iran, Uzbekistan, Indonesia, Brazil, and Russia. The importance and popularity of Telegram are even more highlighted by the fact that after censorship in Iran and Russia, users did not start using other messaging apps, but instead turned to VPN services to circumvent the block, rendering the censorship ineffective. A detailed analysis of Iranian users’ behavior in Telegram is shown in this research for the first time. The analysis and statistics may help interested parties in comprehending a deeper insight into Telegram users’ behavior. More than 900,000 Persian channels and 300,000 Persian supergroups have been discovered, crawled and inspected in this study. Based on our explorations, we devised a noble method to measure group qualities in Telegram. Group quality measurement is accomplished by several features we have extracted and studied in this paper.

Journal ArticleDOI
TL;DR: This paper aims to personalize reward-based gamification based on the users’ interests by utilizing intrinsic motivation of users, and shows that personalizing gamification using the proposed method in the long run increases the spending time of the users in a social network.
Abstract: Reward-based gamification which increases user engagement in social networks is known as an extrinsic motivation to the users for a short term. Having said that, in every social network, there are different types of users with different interests and based on several researches, different users are interested in different game elements. In this paper, we aim to personalize reward-based gamification based on the users’ interests. In this way, by utilizing intrinsic motivation of users, their engagement with social networks will increase in the long run. To this end, SNPG method which implements fuzzy like concept on game elements has been proposed. On top of that, a two-round experiment on user engagement with a social network using the proposed method has been conducted. Results have shown that personalizing gamification using the proposed method in the long run increases the spending time of the users in a social network by 63.34% and their page views by 141.9% in comparison with the regular gamification. Furthermore, gender differences in using game elements have been studied, and results have shown that men are 5% more interested in personalized gamification than women and leaderboard is the most popular game element among men and women.

Journal ArticleDOI
TL;DR: In this study, prominent detection methods of spams are analyzed and how the real users and fake users are distinguished as well as weak and strong aspects of the methods for these processes are compared and evaluated.
Abstract: Social networks have become an inseparable part of our lives today. Services such as Facebook, Twitter, Instagram, Google + and LinkedIn in particular have had a significant place in Internet use in recent years. People establish instant interactions between each other over the Internet using these social services. They get many advantages such as creating their own groups, being informed about different interest areas and being able to make many contacts. Twitter is one of the mostly used platforms among the social networks. A social network that is being used so commonly has become a target for the vicious people (spammers). There is an increase in the number of spammers on Twitter too. Malicious content and messages (spams) prepared by the spammers do threat the security as well as performance. The first and most important condition to protect against this threat is to know the harmful methods of spam. Thus, this will make it easier to detect and protect. In this study, prominent detection methods of spams are analyzed. How the real users and fake users are distinguished as well as weak and strong aspects of the methods for these processes are compared and evaluated.

Journal ArticleDOI
TL;DR: The results show that Twitter information could be converted into a performant tool to organize digital department of the candidates helping to clarify the impact of their messages on the future voters.
Abstract: Social networking service provider such as Twitter has become very popular communication tools for Internet and Mobile users. Our paper aims at participating into the topical debate if a tweet can shed light on future elections. After listening tweets collected during the Spanish 2019 presidential campaign between April 12 and April 26, we perform a statistical and computing analysis (based on R software) in order to reveal political discourse of the parties engaged and highlight the main messages conveyed and their resulted impact in the share of candidates’ voice. Our results show that Twitter information could be converted into a performant tool to organize digital department of the candidates helping to clarify the impact of their messages on the future voters. Our methodology is based on the use of different machine learning algorithms to clean and analyse 1.7 million of tweets.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed approach enables the use of social network images for post-disaster infrastructure damage assessment and provides an inexpensive and feasible alternative to the more expensive GIS approach.
Abstract: Traditional post-disaster assessment of damage heavily relies on expensive geographic information system (GIS) data, especially remote sensing image data. In recent years, social media have become a rich source of disaster information that may be useful in assessing damage at a lower cost. Such information includes text (e.g., tweets) or images posted by eyewitnesses of a disaster. Most of the existing research explores the use of text in identifying situational awareness information useful for disaster response teams. The use of social media images to assess disaster damage is limited. We have recently proposed a novel approach, based on convolutional neural networks and class activation mapping, to locate building damage in a disaster image and to quantify the degree of the damage. In this paper, we study the usefulness of the proposed approach for other categories of infrastructure damage, specifically bridge and road damage, and compare two-class activation mapping approaches in this context. Experimental results show that our proposed approach enables the use of social network images for post-disaster infrastructure damage assessment and provides an inexpensive and feasible alternative to the more expensive GIS approach.

Journal ArticleDOI
TL;DR: TaxoSoft is developed, a methodology for building a soft skill taxonomy that uses DBpedia and Word2Vec in order to find terms related to different soft skills and also uses social network analysis to build a hierarchy of terms.
Abstract: Soft skills are crucial for candidates in the job market, and analyzing these skills listed in job ads can help in identifying the most important soft skills required by recruiters. This analysis can benefit from building a taxonomy to extract soft skills. However, most prior work is primarily focused on building hard skill taxonomies. Unfortunately, methodologies for building hard skill taxonomies do not work well for soft skills, due to the wide variety of terminologies used to list soft skills in job ads. Moreover, prior work has mainly focused on extracting soft skills from job ads using a simple keyword search, which can fail to detect the different forms in which soft skills are listed in job ads. In this paper, we develop TaxoSoft, a methodology for building a soft skill taxonomy that uses DBpedia and Word2Vec in order to find terms related to different soft skills. TaxoSoft also uses social network analysis to build a hierarchy of terms. We use this method to build soft skill taxonomies in both English and French. We evaluate TaxoSoft on a sample of job ads and find that it achieves an F-score of 0.84, while taxonomies developed in prior work achieve an F-score of only 0.54. We then use the proposed methodology to analyze soft skills listed in job ads in order to find the skills most required in the American and Moroccan job markets. Our findings can offer insights to universities about the top soft skills requested in the job market.

Journal ArticleDOI
TL;DR: An approach which extracts features of the textual information, a widely available source of information in venue category, to compute a confidence metric for the ratings that are computed from texts, which is used in the user similarity computation and venue rating prediction formulation process, along with the computed rating.
Abstract: One of the major problems that social media front is to continuously produce successful, user-targeted information, in the form of recommendations, which are produced by applying methods from the area of recommender systems. One of the most important applications of recommender systems in social networks is venue recommendation, targeted by the majority of the leading social networks (Facebook, TripAdvisor, OpenTable, etc.). However, recommender systems’ algorithms rely only on the existence of numeric ratings which are typically entered by users, and in the context of social networks, this information is scarce, since many social networks allow only reviews, rather than explicit ratings. Even if explicit ratings are supported, users may still resort to expressing their views and rating their experiences through submitting posts, which is the predominant user practice in social networks, rather than entering explicit ratings. User posts contain textual information, which can be exploited to compute derived ratings, and these derived ratings can be used in the recommendation process in the lack of explicitly entered ratings. Emerging recommender systems encompass this approach, without however tackling the fact that the ratings computed on the basis of textual information may be inaccurate, due to the very nature of the computation process. In this paper, we present an approach which extracts features of the textual information, a widely available source of information in venue category, to compute a confidence metric for the ratings that are computed from texts; then, this confidence metric is used in the user similarity computation and venue rating prediction formulation process, along with the computed rating. Furthermore, we propose a venue recommendation method that considers the generated venue rating predictions, along with venue QoS, similarity and spatial distance metrics in order to generate venue recommendations for social network users. Finally, we validate the accuracy of the rating prediction method and the user satisfaction from the recommendations generated by the recommendation formulation algorithm. Conclusively, the introduction of the confidence level significantly improves rating prediction accuracy, leverages the ability to generate personalized recommendations for users and increases user satisfaction.

Journal ArticleDOI
TL;DR: The findings suggest that Trump may have indeed possessed unique appeal to individuals drawn to hateful ideologies; however, such individuals constituted a small fraction of the sampled population.
Abstract: We characterize the Twitter networks of the major presidential candidates, Donald J. Trump and Hillary R. Clinton, with various American hate groups defined by the US Southern Poverty Law Center (S ...

Journal ArticleDOI
TL;DR: A novel methodology based on gradient approach is proposed in this paper to deal with the problem of influence maximization in a social network, and provides a balance between influence spread and execution time.
Abstract: In social network analysis, one of the significant problems is finding the most influential entities within the network, which has proved to be NP-hard. The problem of influence maximization in a social network is an optimization problem that ensures that the spread of influence in the network is maximized. Although many algorithms have been proposed for influence maximization, most of them provide influence spread, at the cost of execution time. Therefore, a novel methodology based on gradient approach is proposed in this paper to deal with the problem. This approach provides a balance between influence spread and execution time. In this research, the performance of the proposed algorithm has been compared with existing algorithms and observations of a better influence spread per second are presented. This task has significance in viral marketing, since the most influential entities can be targeted for endorsing new products in the market at a faster rate.

Journal ArticleDOI
TL;DR: The DBSCAN clustering algorithm has been implemented for the task of detecting the outliers in the process of detects the communities in a social network, and those outliers which are also known as “noisy nodes” are removed from the main formed network graph.
Abstract: The detection of the communities and the depiction of the interactions, between the entities and the individuals in the real world network graphs is a challenging problem. There are many conventional ways to detect those interconnected nodes which lead to the detection of communities. The strength of the detected communities can be detected by its modularity which is a measurement of the structure of a graph, and increasing the modularity is also a bit challenging problem. So, in this work, the DBSCAN clustering algorithm has been implemented for the task of detecting the outliers in the process of detecting the communities in a social network, and those outliers which are also known as “noisy nodes”, are removed from the main formed network graph. The proposed algorithm in this paper, mainly focuses on the detection and removal of those noisy nodes or outliers in the detected communities which leads to the improvement of the quality of the detected communities. In previous community detection algorithms, some algorithms needed the number of communities prior to the formation of communities which precludes from forming a good community, while some algorithms cannot operate with the huge amount of data and some algorithms require a huge amount of memory. The proposed algorithm does not require any prior mentioning of the number of communities, it has also been tested with large networks with a size of more than 1000 nodes and it does not require much space. Therefore, the proposed algorithm has overcome the mentioned limitations of the previous community detection algorithms. The data have been collected from the social network websites-Facebook and Twitter. The communities formed from the proposed algorithm have been compared with the results of the four other community detection algorithms, i.e., with the Louvain algorithm, Walktrap algorithm, Leading eigenvector algorithm, and Fastgreedy algorithm. The proposed methodology performs well for the detection of communities with the increment of the strength of the detected communities.

Journal ArticleDOI
TL;DR: The proposed Stacking-based Ensemble using Statistical features and Informative Words (SESIW) is proposed for detecting the tweets related to damage assessment and it outperforms the baseline SVM with Bag-of-Words model.
Abstract: Nowadays, Twitter has become more popular among the users for communicating the information, especially during disaster. Identifying tweets related to the target event during disaster is a challenging task. Many prior studies discussed situational and non-situational information related to disaster. The detection of tweets related to damage assessment is a very difficult task in social media because it is a subset of situational information. One of the following drawbacks has been present in the existing damage assessment works: (1) focused only on infrastructure damage but does not include human damage in the assessment, (2) focused only on social media image data for damage assessment and (3) focused only on regional language tweets. To overcome these issues, Stacking-based Ensemble using Statistical features and Informative Words (SESIW) is proposed for detecting the tweets related to damage assessment. It uses proposed features, namely frequency of hashtags, user mentions, wh-words, URLs, count of numerals and informative words. Informative words are mined using term frequency and inverse document frequency technique. The SESIW method is tested on different Twitter disaster datasets, and it outperforms the baseline SVM with Bag-of-Words model.

Journal ArticleDOI
TL;DR: The problem of influence maximization using two aspects, node connectivity and node activity level, has been studied and UAC-Rank algorithm for the identification of initial adopters has been proposed.
Abstract: Influence maximization deals with the problem of identifying k-size subset of nodes in a social network that can maximize the influence spread in the network. In this paper, the problem of influence maximization using two aspects, node connectivity and node activity level, has been studied. To measure node connectivity, the widely popular and intuitive measure of out-degree of node has been used, and for node activity, node’s past interactions have been taken into consideration. For studying influence spread, two activity-based diffusion models, namely Activity-based Independent Cascade model and Activity-based Linear Threshold model, have been proposed in which influence propagation is driven by a node’s activity that it has actually performed in the past. Activity-based models aim at studying influence spread by incorporating a more realistic aspect corresponding to user behavior. Motivated by the belief that activity is as important as connectivity, UAC-Rank algorithm for the identification of initial adopters has been proposed.

Journal ArticleDOI
TL;DR: In this paper, the authors presented a modeling approach for characterizing social interaction networks by jointly inferring user communities and interests based on social media interactions, and observed the interaction topics and communities related to two big events within Purdue University community, namely Purdue Day of Giving and Senator Bernie Sanders' visit to Purdue University as part of Indiana Primary Election 2016.
Abstract: Online social media have become an integral part of our social beings. Analyzing conversations in social media platforms can lead to complex probabilistic models to understand social interaction networks. In this paper, we present a modeling approach for characterizing social interaction networks by jointly inferring user communities and interests based on social media interactions. We present several pattern inference models: (1) interest pattern model (IPM) captures population level interaction topics, (2) user interest pattern model (UIPM) captures user specific interaction topics, and (3) community interest pattern model (CIPM) captures both community structures and user interests. We test our methods on Twitter data collected from Purdue University community. From our model results, we observe the interaction topics and communities related to two big events within Purdue University community, namely Purdue Day of Giving and Senator Bernie Sanders’ visit to Purdue University as part of Indiana Primary Election 2016. Constructing social interaction networks based on user interactions accounts for the similarity of users’ interactions on various topics of interest and indicates their community belonging further beyond connectivity. We observed that the degree-distributions of such networks follow power-law that is indicative of the existence of fewer nodes in the network with higher levels of interactions, and many other nodes with less interaction. We also discuss the application of such networks as a useful tool to effectively disseminate specific information to the target audience towards planning any large-scale events and demonstrate how to single out specific nodes in a given community by running network algorithms.

Journal ArticleDOI
TL;DR: A Web-based Kernel function for measuring the semantic relatedness between concepts to disambiguate an expression versus multiple concepts is reintroduced and evaluated using an Arabic short text categorization system.
Abstract: Recently short text messages, tweets, comments and so on, have become a large portion of the online text data. They are limited in length and different from traditional documents in their shortness and sparseness. As a result, short text tends to be ambiguous and its degree is not the same for all languages; and as Arabic is a very high flexional language, where a single word can have multiple meanings, the short text representation plays a vital role in any Text Mining task. To address these issues, we propose an efficient representation for short text based on concepts instead of terms using BabelNet as an external knowledge. However, in the conceptualization process, while searching polysemic term-corresponding concepts, multiple matches are detected. Therefore, assigning a term to a concept is a crucial step and we believe that short text similarity can be useful to overcome the problem of mapping term to the corresponding concept. In this paper, we reintroduce Web-based Kernel function for measuring the semantic relatedness between concepts to disambiguate an expression versus multiple concepts. The proposed method has been evaluated using an Arabic short text categorization system and the obtained results illustrate the interest of our contribution.

Journal ArticleDOI
TL;DR: The proposed lightweight crisis management framework integrates natural language processing and clustering techniques in order to produce a ranking of tweets relevant to a crisis situation based on their informativeness.
Abstract: Obtaining relevant timely information during crisis events is a challenging task that can be fundamental to handle the consequences deriving from both unexpected events (e.g., terrorist attacks) and partially predictable ones (i.e., natural disasters). Even though microblogging-based online social networks (e.g., Twitter) have become an attractive data source in these emergency situations, overcoming the information overload deriving from mass events is not trivial. The aim of this work was to enable unsupervised extraction of relevant information from Twitter data during a crisis event, offering a lightweight alternative to learning-based approaches. The proposed lightweight crisis management framework integrates natural language processing and clustering techniques in order to produce a ranking of tweets relevant to a crisis situation based on their informativeness. Experiments carried out on six Twitter collections in two languages (English and French) proved the significance and the flexibility of our approach.

Journal ArticleDOI
TL;DR: This paper presents a principled automated approach to distinguish these different cases while assessing and classifying news articles and claims based on a hierarchy of five different kinds of fakeness and systematically explores a variety of signals from social media.
Abstract: Fake news, doubtful statements and other unreliable content not only differ with regard to the level of misinformation but also with respect to the underlying intents. Prior work on algorithmic truth assessment has mostly pursued binary classifiers—factual versus fake—and disregarded these finer shades of untruth. In manual analyses of questionable content, in contrast, more fine-grained distinctions have been proposed, such as distinguishing between hoaxes, irony and propaganda or the six-way truthfulness ratings by the PolitiFact community. In this paper, we present a principled automated approach to distinguish these different cases while assessing and classifying news articles and claims. Our method is based on a hierarchy of five different kinds of fakeness and systematically explores a variety of signals from social media, capturing both the content and language of posts and the sharing and dissemination among users. The paper provides experimental results on the performance of our fine-grained classifier and a detailed analysis of the underlying features.

Journal ArticleDOI
TL;DR: The results indicate that complex networks from various domains have distinct structural properties that allow us to predict with high accuracy the category of a new previously unseen network, and synthetic graphs are trivial to classify as the classification model can predict with near-certainty the graph model used to generate it.
Abstract: Complex networks arise in many domains and often represent phenomena such as brain activity, social relationships, molecular interactions, hyperlinks, and re-tweets. In this work, we study the problem of predicting the category (domain) of arbitrary networks. This includes complex networks from different domains as well as synthetically generated graphs from six different network models. We formulate this problem as a multiclass classification problem and learn a model to predict the domain of a new previously unseen network using only a small set of simple structural features. The model is able to accurately predict the domain of arbitrary networks from 17 different domains with 95.7% accuracy. This work makes two important findings. First, our results indicate that complex networks from various domains have distinct structural properties that allow us to predict with high accuracy the category of a new previously unseen network. Second, synthetic graphs are trivial to classify as the classification model can predict with near-certainty the graph model used to generate it. Overall, the results demonstrate that networks drawn from different domains and graph models are distinguishable using a few simple structural features.

Journal ArticleDOI
TL;DR: This paper proposes a new ranking algorithm, named Suspiciousness Rank Back and Forth (SRBF), that, given one of the networks in the ICIJ Offshore Leaks Database, leverages the network structure and the blacklist ground truth to assign a degree of suspiciousness to each entity in the network.
Abstract: The ICIJ Offshore Leaks Database represents a large set of relationships between people, companies, and organizations involved in the creation of offshore companies in tax-heaven territories, mainly for hiding their assets. This data are organized into four networks of entities and their interactions: Panama Papers, Paradise Papers, Offshore Leaks, and Bahamas Leaks. For instance, the entities involved in the Panama Papers networks are people or companies that had affairs with the Panamanian offshore law firm Mossack Fonseca, often with the purpose of laundering money. In this paper, we address the problem of searching the ICIJ Offshore Leaks Database for people and companies that may be involved in illegal acts. We use a collection of international blacklists of sanctioned people and organizations as ground truth for bad entities. We propose a new ranking algorithm, named Suspiciousness Rank Back and Forth (SRBF), that, given one of the networks in the ICIJ Offshore Leaks Database, leverages the network structure and the blacklist ground truth to assign a degree of suspiciousness to each entity in the network. We experimentally show that our algorithm outperforms existing techniques for node classification achieving area under the ROC curve ranging from 0.69 to 0.85 and an area under the recall curve ranging from 0.70 to 0.84 on three of the four considered networks. Moreover, our algorithm retrieves bad entities earlier in the rank than competitors. Further, we show the effectiveness of SRBF on a case study on the Panama Papers network.

Journal ArticleDOI
TL;DR: The paper describes the ParSoDA library and presents two social data analysis applications to assess its usability and scalability, and compares the programming effort required for coding a social media application using versus not using the Par soDA library.
Abstract: Software systems for social data mining provide algorithms and tools for extracting useful knowledge from user-generated social media data. ParSoDA (Parallel Social Data Analytics) is a high-level library for developing parallel data mining applications based on the extraction of useful knowledge from large data set gathered from social media. The library aims at reducing the programming skills needed for implementing scalable social data analysis applications. To reach this goal, ParSoDA defines a general structure for a social data analysis application that includes a number of configurable steps and provides a predefined (but extensible) set of functions that can be used for each step. User applications based on the ParSoDA library can be run on both Apache Hadoop and Spark clusters. The paper describes the ParSoDA library and presents two social data analysis applications to assess its usability and scalability. Concerning usability, we compare the programming effort required for coding a social media application using versus not using the ParSoDA library. The comparison shows that ParSoDA leads to a drastic reduction (i.e., about 65%) of lines of code, since the programmer only has to implement the application logic without worrying about configuring the environment and related classes. About scalability, using a cluster with 300 cores and 1.2 TB of RAM, ParSoDA is able to reduce the execution time of such applications up to 85%, compared to a cluster with 25 cores and 100 GB of RAM.