scispace - formally typeset
Search or ask a question

Showing papers on "Microblogging published in 2018"


Journal ArticleDOI
TL;DR: This paper proposes an approach to detect hate expressions on Twitter based on unigrams and patterns that are automatically collected from the training set and used, among others, as features to train a machine learning algorithm.
Abstract: With the rapid growth of social networks and microblogging websites, communication between people from different cultural and psychological backgrounds has become more direct, resulting in more and more “cyber” conflicts between these people. Consequently, hate speech is used more and more, to the point where it has become a serious problem invading these open spaces. Hate speech refers to the use of aggressive, violent or offensive language, targeting a specific group of people sharing a common property, whether this property is their gender (i.e., sexism), their ethnic group or race (i.e., racism) or their believes and religion. While most of the online social networks and microblogging websites forbid the use of hate speech, the size of these networks and websites makes it almost impossible to control all of their content. Therefore, arises the necessity to detect such speech automatically and filter any content that presents hateful language or language inciting to hatred. In this paper, we propose an approach to detect hate expressions on Twitter. Our approach is based on unigrams and patterns that are automatically collected from the training set. These patterns and unigrams are later used, among others, as features to train a machine learning algorithm. Our experiments on a test set composed of 2010 tweets show that our approach reaches an accuracy equal to 87.4% on detecting whether a tweet is offensive or not (binary classification), and an accuracy equal to 78.4% on detecting whether a tweet is hateful, offensive, or clean (ternary classification).

251 citations


Journal ArticleDOI
TL;DR: A price-rising-based iterative matching algorithm is proposed to solve the formulated joint peer discovery, power control, and channel selection problem under various quality-of-service requirements and numerical results demonstrate the effectiveness and superiority of the proposed algorithm from the perspectives of weighted sum rate and matching satisfaction gains.
Abstract: By analogy with Internet of things, Internet of vehicles (IoV) that enables ubiquitous information exchange and content sharing among vehicles with little or no human intervention is a key enabler for the intelligent transportation industry. In this paper, we study how to combine both the physical and social layer information for realizing rapid content dissemination in device-to-device vehicle-to-vehicle (D2D-V2V)-based IoV networks. In the physical layer, headway distance of vehicles is modeled as a Wiener process, and the connection probability of D2D-V2V links is estimated by employing the Kolmogorov equation. In the social layer, the social relationship tightness that represents content selection similarities is obtained by Bayesian nonparametric learning based on real-world social big data, which are collected from the largest Chinese microblogging service Sina Weibo and the largest Chinese video-sharing site Youku. Then, a price-rising-based iterative matching algorithm is proposed to solve the formulated joint peer discovery, power control, and channel selection problem under various quality-of-service requirements. Finally, numerical results demonstrate the effectiveness and superiority of the proposed algorithm from the perspectives of weighted sum rate and matching satisfaction gains.

181 citations


Journal ArticleDOI
TL;DR: A survey of a wide variety of event detection methods applied to streaming Twitter data, classifying them according to shared common traits, and then discusses different aspects of the subtasks and challenges involved in event detection.
Abstract: The proliferation of social networking services has resulted in a rapid growth of their user base, spanning across the world. The collective information generated from these online platforms is ove...

152 citations


Journal ArticleDOI
TL;DR: It is suggested that there is a need for an accurate and tested tool for sentiment analysis of tweets trained using a health care setting–specific corpus of manually annotated tweets first.
Abstract: Background: Twitter is a microblogging service where users can send and read short 140-character messages called “tweets.” There are several unstructured, free-text tweets relating to health care being shared on Twitter, which is becoming a popular area for health care research. Sentiment is a metric commonly used to investigate the positive or negative opinion within these messages. Exploring the methods used for sentiment analysis in Twitter health care research may allow us to better understand the options available for future research in this growing field. Objective: The first objective of this study was to understand which tools would be available for sentiment analysis of Twitter health care research, by reviewing existing studies in this area and the methods they used. The second objective was to determine which method would work best in the health care settings, by analyzing how the methods were used to answer specific health care questions, their production, and how their accuracy was analyzed. Methods: A review of the literature was conducted pertaining to Twitter and health care research, which used a quantitative method of sentiment analysis for the free-text messages (tweets). The study compared the types of tools used in each case and examined methods for tool production, tool training, and analysis of accuracy. Results: A total of 12 papers studying the quantitative measurement of sentiment in the health care setting were found. More than half of these studies produced tools specifically for their research, 4 used open source tools available freely, and 2 used commercially available software. Moreover, 4 out of the 12 tools were trained using a smaller sample of the study’s final data. The sentiment method was trained against, on an average, 0.45% (2816/627,024) of the total sample data. One of the 12 papers commented on the analysis of accuracy of the tool used. Conclusions: Multiple methods are used for sentiment analysis of tweets in the health care setting. These range from self-produced basic categorizations to more complex and expensive commercial software. The open source and commercial methods are developed on product reviews and generic social media messages. None of these methods have been extensively tested against a corpus of health care messages to check their accuracy. This study suggests that there is a need for an accurate and tested tool for sentiment analysis of tweets trained using a health care setting–specific corpus of manually annotated tweets first.

113 citations


Journal ArticleDOI
TL;DR: This paper presents a hybrid approach for detecting automated spammers by amalgamating community-based features with other feature categories, namely metadata-, content-, and interaction-basedFeatures, and the discrimination power of different feature categories is analyzed.
Abstract: Twitter is one of the most popular microblogging services, which is generally used to share news and updates through short messages restricted to 280 characters. However, its open nature and large user base are frequently exploited by automated spammers, content polluters, and other ill-intended users to commit various cybercrimes, such as cyberbullying, trolling, rumor dissemination, and stalking. Accordingly, a number of approaches have been proposed by researchers to address these problems. However, most of these approaches are based on user characterization and completely disregarding mutual interactions. In this paper, we present a hybrid approach for detecting automated spammers by amalgamating community-based features with other feature categories, namely metadata- , content- , and interaction-based features. The novelty of the proposed approach lies in the characterization of users based on their interactions with their followers given that a user can evade features that are related to his/her own activities, but evading those based on the followers is difficult. Nineteen different features, including six newly defined features and two redefined features, are identified for learning three classifiers, namely, random forest , decision tree , and Bayesian network , on a real dataset that comprises benign users and spammers. The discrimination power of different feature categories is also analyzed, and interaction- and community-based features are determined to be the most effective for spam detection, whereas metadata-based features are proven to be the least effective.

96 citations


Journal ArticleDOI
TL;DR: The findings revealed that the proposed method overcomes the limitations of previous methods by considering slang, emoticons, and domain‐specific terms.
Abstract: Of the many social media sites available, users prefer microblogging services such as Twitter to learn about product services, social events, and political trends. Twitter is considered an important source of information in sentiment analysis applications. Supervised and unsupervised machine learning-based techniques for Twitter data analysis have been investigated in the last few years, often resulting in an incorrect classification of sentiments. In this paper, we focus on these issues and present a unified framework for classifying tweets using a hybrid classification scheme. The proposed method aims at improving the performance of Twitter-based sentiment analysis systems by incorporating 4 classifiers: (a) a slang classifier, (b) an emoticon classifier, (c) the SentiWordNet classifier, and (d) an improved domain-specific classifier. After applying the preprocessing steps, the input text is passed through the emoticon and slang classifiers. In the next stage, SentiWordNet-based and domain-specific classifiers are applied to classify the text more accurately. Finally, sentiment classification is performed at sentence and document levels. The findings revealed that the proposed method overcomes the limitations of previous methods by considering slang, emoticons, and domain-specific terms.

90 citations


28 Mar 2018
TL;DR: This paper proposes to use the linked users across social networking sites and e-commerce websites as a bridge to map users’ social networking features to another feature representation for product recommendation, and develops a feature-based matrix factorization approach which can leverage the learnt user embeddings for cold-start product recommendation.
Abstract: In recent years, the boundaries between ecommerce and social networking have become increasingly blurred. Many e-commerce websites support the mechanism of social login where users can sign on the websites using their social network identities such as their Facebook or Twitter accounts. Users can also post their newly purchased products on microblogs with links to the e-commerce product web pages. In this paper we propose a novel solution for cross-site coldstart product recommendation which aims to recommend products from e-commerce websites to users at social networking sites in “cold-start” situations, a problem which has rarely been explored before. A major challenge is how to leverage knowledge extracted from social networking sites for cross-site cold-start product We propose to use the linked users across social networking sites and e-commerce websites (users who have social networking accounts and have made purchases on e-commerce websites) as a bridge to map users’ social networking features to another feature representation for product recommendation. In specific, we propose learning both users’ and products’ feature representations (called user embeddings and product embeddings, respectively) from data collected from ecommerce websites using recurrent neural networks and then apply a modified gradient boosting trees method to transform users’ social networking features into user embeddings. We then develop a feature-based matrix factorization approach which can leverage the learnt user embeddings for cold-start product recommendation. Experimental results on a large dataset constructed from the largest Chinese microblogging service SINA WEIBO and the largest Chinese B2C e-commerce website JINGDONG have shown the effectiveness of our proposed framework.

83 citations


Proceedings ArticleDOI
01 Aug 2018
TL;DR: This research is merging Support Vector Machine with Decision Tree and experimental results prove that the proposed approach is providing better classification results in terms of f-measure and accuracy in contrast to individual classifiers.
Abstract: Microblogging websites like Twitter and Facebook, in this new era, is loaded with opinions and data. One of the most widely used micro-blogging site, Twitter, is where people share their ideas in the form of tweets and therefore it becomes one of the best sources for sentimental analysis. Opinions can be widely grouped into three categories good for positive, bad for negative and neutral and the process of analyzing differences of opinions and grouping them in all these categories is known as Sentiment Analysis. Data mining is basically used to uncover relevant information from web pages especially from the social networking sites. Merging data mining with other fields like text mining, NLP and computational intelligence we are able to classify tweets as good, bad or neutral. The main emphasis of this research is on the classification of emotions of tweets' data gathered from Twitter. In the past, researchers were using existing machine learning techniques for sentiment analysis but the results showed that existing machine learning techniques were not providing better results of sentiment classification. In order to improve classification results in the domain of sentiment analysis, we are using ensemble machine learning techniques for increasing the efficiency and reliability of proposed approach. For the same, we are merging Support Vector Machine with Decision Tree and experimental results prove that our proposed approach is providing better classification results in terms of f-measure and accuracy in contrast to individual classifiers.

76 citations


Journal ArticleDOI
TL;DR: The GeoCorpora corpus building framework and software tools as well as a geo-annotated Twitter corpus built with these tools are presented to foster research and development in the areas of microblog/Twitter geoparsing and geographic information retrieval.
Abstract: In this article, we present the GeoCorpora corpus building framework and software tools as well as a geo-annotated Twitter corpus built with these tools to foster research and development in the areas of microblog/Twitter geoparsing and geographic information retrieval. The developed framework employs crowdsourcing and geovisual analytics to support the construction of large corpora of text in which the mentioned location entities are identified and geolocated to toponyms in existing geographical gazetteers. We describe how the approach has been applied to build a corpus of geo-annotated tweets that will be made freely available to the research community alongside this article to support the evaluation, comparison and training of geoparsers. Additionally, we report lessons learned related to corpus construction for geoparsing as well as insights about the notions of place and natural spatial language that we derive from application of the framework to building this corpus.

75 citations


Journal ArticleDOI
TL;DR: This work develops a novel classification-summarization framework which handles tweets in both English and Hindi, and is the first attempt to extract situational information from non-English tweets.
Abstract: Microblogging sites like Twitter have become important sources of real-time information during disaster events. A large amount of valuable situational information is posted in these sites during disasters; however, the information is dispersed among hundreds of thousands of tweets containing sentiments and opinions of the masses. To effectively utilize microblogging sites during disaster events, it is necessary to not only extract the situational information from the large amounts of sentiments and opinions, but also to summarize the large amounts of situational information posted in real-time. During disasters in countries like India, a sizable number of tweets are posted in local resource-poor languages besides the normal English-language tweets. For instance, in the Indian subcontinent, a large number of tweets are posted in Hindi/Devanagari (the national language of India), and some of the information contained in such non-English tweets is not available (or available at a later point of time) through English tweets. In this work, we develop a novel classification-summarization framework which handles tweets in both English and Hindi—we first extract tweets containing situational information, and then summarize this information. Our proposed methodology is developed based on the understanding of how several concepts evolve in Twitter during disaster. This understanding helps us achieve superior performance compared to the state-of-the-art tweet classifiers and summarization approaches on English tweets. Additionally, to our knowledge, this is the first attempt to extract situational information from non-English tweets.

66 citations


Journal ArticleDOI
TL;DR: A novel rumour detection system that leverages on newly designed features, including influence potential and network characteristics measures is presented, which is able to correctly detect about 90% of rumours, with acceptable levels of precision.
Abstract: In the last years social networks have emerged as a critical mean for information spreading bringing along several advantages. At the same time, unverified and instrumentally relevant information statements in circulation, named as rumours, are becoming a potential threat to the society. For this reason, although the identification in social microblogs of which topic is a rumour has been studied in several works, there is the need to detect if a post is either a rumor or not. In this paper we cope with this last challenge presenting a novel rumour detection system that leverages on newly designed features, including influence potential and network characteristics measures. We tested our approach on a real dataset composed of health-related posts collected from Twitter microblog. We observe promising results, as the system is able to correctly detect about 90% of rumours, with acceptable levels of precision.

Journal ArticleDOI
TL;DR: A probabilistic topic modeling-based computational text analysis framework is introduced to answer three questions: What CSR-related topics are being communicated in the Twitter-sphere and what are the prevalent topics or themes in CSR conversation?
Abstract: Corporate social responsibility (CSR) is an essential business practice in industry and a popular topic in academic research. Several studies have attempted to understand topics or categories in CSR contexts and some have used qualitative techniques to analyze data from traditional communication channels such as corporate reports, newspapers, and websites. This study adopts computational content analysis for understanding themes or topics from CSR-related conversations in the Twitter-sphere, the largest microblogging social media platform. Specifically, a probabilistic topic modeling-based computational text analysis framework is introduced to answer three questions: (1) What CSR-related topics are being communicated in the Twitter-sphere and what are the prevalent topics or themes in CSR conversation? (topic prevalence); (2) How are those topics interrelated? (topic correlation); (3) How have those topics changed over time? (topic evolution). The topic modeling results are discussed, and the direction for future research is presented.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper studied the information spread from three main perspectives: individual characteristics, the types of social relationships between interactive participants, and the topology of real interaction networks.
Abstract: Social media analytics has drawn new quantitative insights of human activity patterns. Many applications of social media analytics, from pandemic prediction to earthquake response, require an in-depth understanding of how these patterns change when human encounter unfamiliar conditions. In this paper, we select two earthquakes in China as the social context in Sina-Weibo (or Weibo for short), the largest Chinese microblog site. After proposing a formalized Weibo information flow model to represent the information spread on Weibo, we study the information spread from three main perspectives: individual characteristics, the types of social relationships between interactive participants, and the topology of real interaction networks. The quantitative analyses draw the following conclusions. First, the shadow of Dunbar’s number is evident in the “declared friends/followers” distributions, and the number of each participant’s friends/followers who also participated in the earthquake information dissemination show the typical power-law distribution, indicating a rich-gets-richer phenomenon. Second, an individual’s number of followers is the most critical factor in user influence. Strangers are very important forces for disseminating real-time news after an earthquake. Third, two types of real interaction networks share the scale-free and small-world property, but with a looser organizational structure. In addition, correlations between different influence groups indicate that when compared with other online social media, the discussion on Weibo is mainly dominated and influenced by verified users.

Journal ArticleDOI
TL;DR: The approach extends the classical sentiment analysis methods, which only consider text content, by adding a novel PageRank-based influential user finding algorithm and shows that the proposed sentiment analysis method is more effective in finding topic based microblogging community’s sentiment polarity.
Abstract: Nowadays, social microblogging services have become a popular expression platform of what people think. People use these platforms to produce content on different topics from finance, politics and sports to sociological fields in real-time. With the proliferation of social microblogging sites, the massive amount of opinion texts have become available in digital forms, thus enabling research on sentiment analysis to both deepen and broaden in different sociological fields. Previous sentiment analysis research on microblogging services generally focused on text as the unique source of information, and did not consider the social microblogging service network information. Inspired by the social network analysis research and sentiment analysis studies, we find that people’s trust in a community have an important place in determining the community’s sentiment polarity about a topic. When studies in the literature are examined, it is seen that trusted users in a community are actually influential users. Hence, we propose a novel sentiment analysis approach that takes into account the social network information as well. We concentrate on the effect of influential users on the sentiment polarity of a topic based microblogging community. Our approach extends the classical sentiment analysis methods, which only consider text content, by adding a novel PageRank-based influential user finding algorithm. We have carried out a comprehensive empirical study of two real-world Twitter datasets to analyze the correlation between the mood of the financial social community and the behavior of the stock exchange of Turkey, namely BIST100, using Pearson correlation coefficient method. Experimental results validate our assumptions and show that the proposed sentiment analysis method is more effective in finding topic based microblogging community’s sentiment polarity.

Journal ArticleDOI
25 May 2018-PLOS ONE
TL;DR: It is found that American English is the dominant form of English outside the UK and that its influence is felt even within the UK borders.
Abstract: As global political preeminence gradually shifted from the United Kingdom to the United States, so did the capacity to culturally influence the rest of the world. In this work, we analyze how the world-wide varieties of written English are evolving. We study both the spatial and temporal variations of vocabulary and spelling of English using a large corpus of geolocated tweets and the Google Books datasets corresponding to books published in the US and the UK. The advantage of our approach is that we can address both standard written language (Google Books) and the more colloquial forms of microblogging messages (Twitter). We find that American English is the dominant form of English outside the UK and that its influence is felt even within the UK borders. Finally, we analyze how this trend has evolved over time and the impact that some cultural events have had in shaping it.

Proceedings Article
15 Jun 2018
TL;DR: A malicious practice perpetrated by coordinated groups of bots and likely aimed at promoting lowvalue stocks by exploiting the popularity of high-value ones is uncovered and called for the adoption of spam and bot detection techniques in all studies and applications that exploit usergenerated content for predicting the stock market.
Abstract: Microblogs are increasingly exploited for predicting prices and traded volumes of stocks in financial markets. However, it has been demonstrated that much of the content shared in microblogging platforms is created and publicized by bots and spammers. Yet, the presence (or lack thereof) and the impact of fake stock microblogs has never systematically been investigated before. Here, we study 9M tweets related to stocks of the 5 main financial markets in the US. By comparing tweets with financial data from Google Finance, we highlight important characteristics of Twitter stock microblogs. More importantly, we uncover a malicious practice perpetrated by coordinated groups of bots and likely aimed at promoting low-value stocks by exploiting the popularity of high-value ones. Our results call for the adoption of spam and bot detection techniques in all studies and applications that exploit user-generated content for predicting the stock market.

Journal ArticleDOI
TL;DR: The results highlight that both large EROs and individual digital volunteers proactively used Twitter to disseminate and distribute fire related information, and it is found that the contents of tweets were more informative than directive.
Abstract: Social media plays a significant role in rapid propagation of information when disasters occur. Among the four phases of disaster management life cycle: prevention, preparedness, response, and recovery, this paper focuses on the use of social media during the response phase. It empirically examines the use of microblogging platforms by Emergency Response Organisations (EROs) during extreme natural events, and distinguishes the use of Twitter by EROs from digital volunteers during a fire hazard occurred in Australia state of Victoria in early February 2014. We analysed 7982 tweets on this event. While traditionally theories such as World System Theory and Institutional Theory focus on the role of powerful institutional information outlets, we found that platforms like Twitter challenge such notion by sharing the power between large institutional (e.g. EROs) and smaller non-institutional players (e.g. digital volunteers) in the dissemination of disaster information. Our results highlight that both large EROs and individual digital volunteers proactively used Twitter to disseminate and distribute fire related information. We also found that the contents of tweets were more informative than directive, and that while the total number of messages posted by top EROs was higher than the non-institutional ones, non-institutions presented a greater number of retweets.

Journal ArticleDOI
TL;DR: A supervised white-box microblogging SA system to analyse user reviews on certain products using rough set theory (RST)-based rule induction algorithms and results show the proposed method, when compared with baseline methods, is excellent, with regard to accuracy, coverage and the number of rules employed.
Abstract: The rapid evolution of microblogging and the emergence of sites such as Twitter have propelled online communities to flourish by enabling people to create, share and disseminate free-flowing messages and information globally. The exponential growth of product-based user reviews has become an ever-increasing resource playing a key role in emerging Twitter-based sentiment analysis (SA) techniques and applications to collect and analyse customer trends and reviews. Existing studies on supervised black-box sentiment analysis systems do not provide adequate information, regarding rules as to why a certain review was classified to a class or classification. The accuracy in some ways is less than our personal judgement. To address these shortcomings, alternative approaches, such as supervised white-box classification algorithms, need to be developed to improve the classification of Twitter-based microblogs. The purpose of this study was to develop a supervised white-box microblogging SA system to analyse user reviews on certain products using rough set theory (RST)-based rule induction algorithms. RST classifies microblogging reviews of products into positive, negative, or neutral class using different rules extracted from training decision tables using RST-centric rule induction algorithms. The primary focus of this study is also to perform sentiment classification of microblogs (i.e. also known as tweets) of product reviews using conventional, and RST-based rule induction algorithms. The proposed RST-centric rule induction algorithm, namely Learning from Examples Module version: 2, and LEM2 $$+$$ Corpus-based rules (LEM2 $$+$$ CBR),which is an extension of the traditional LEM2 algorithm, are used. Corpus-based rules are generated from tweets, which are unclassified using other conventional LEM2 algorithm rules. Experimental results show the proposed method, when compared with baseline methods, is excellent, with regard to accuracy, coverage and the number of rules employed. The approach using this method achieves an average accuracy of 92.57% and an average coverage of 100%, with an average number of rules of 19.14.

Journal ArticleDOI
TL;DR: First, the HRMF expands short text into long text, and then it simultaneously models multi- features of microblogs by designing a new topic model, which realizes hashtag recommendation by calculating the recommended score of each hashtag based on the generated topical representations of multi-features.
Abstract: Hashtag recommendation for microblogs is a very hot research topic that is useful to many applications involving microblogs. However, since short text in microblogs and low utilization rate of hashtags will lead to the data sparsity problem, it is difficult for typical hashtag recommendation methods to achieve accurate recommendation. In light of this, we propose HRMF, a hashtag recommendation method based on multi-features of microblogs in this article. First, our HRMF expands short text into long text, and then it simultaneously models multi-features (i.e., user, hashtag, text) of microblogs by designing a new topic model. To further alleviate the data sparsity problem, HRMF exploits hashtags of both similar users and similar microblogs as the candidate hashtags. In particular, to find similar users, HRMF combines the designed topic model with typical user-based collaborative filtering method. Finally, we realize hashtag recommendation by calculating the recommended score of each hashtag based on the generated topical representations of multi-features. Experimental results on a real-world dataset crawled from Sina Weibo demonstrate the effectiveness of our HRMF for hashtag recommendation.

Journal ArticleDOI
TL;DR: Instagram shows potential as a source of public health information, though limitations in data collection and metadata availability may limit its use in comparison to platforms like Twitter.
Abstract: Background: Social media provides a complementary source of information for public health surveillance. The dominate data source for this type of monitoring is the microblogging platform Twitter, which is convenient due to the free availability of public data. Less is known about the utility of other social media platforms, despite their popularity. Objective: This work aims to characterize the health topics that are prominently discussed in the image-sharing platform Instagram, as a step toward understanding how this data might be used for public health research. Methods: The study uses a topic modeling approach to discover topics in a dataset of 96,426 Instagram posts containing hashtags related to health. We use a polylingual topic model, initially developed for datasets in different natural languages, to model different modalities of data: hashtags, caption words, and image tags automatically extracted using a computer vision tool. Results: We identified 47 health-related topics in the data (kappa=.77), covering ten broad categories: acute illness, alternative medicine, chronic illness and pain, diet, exercise, health care & medicine, mental health, musculoskeletal health and dermatology, sleep, and substance use. The most prevalent topics were related to diet (8,293/96,426; 8.6% of posts) and exercise (7,328/96,426; 7.6% of posts). Conclusions: A large and diverse set of health topics are discussed in Instagram. The extracted image tags were generally too coarse and noisy to be used for identifying posts but were in some cases accurate for identifying images relevant to studying diet and substance use. Instagram shows potential as a source of public health information, though limitations in data collection and metadata availability may limit its use in comparison to platforms like Twitter.

Journal ArticleDOI
TL;DR: Text mining and sentiment analysis of tweets about ISIS indicated that the people used almost the same words when they are tweeting about ISIS, and most users view ISIS as a source of threat and fear regardless where they are from.

Proceedings ArticleDOI
01 Jun 2018
TL;DR: A neural keyphrase extraction framework for microblog posts that takes their conversation context into account is presented, where four types of neural encoders, namely, averaged embedding, RNN, attention, and memory networks, are proposed to represent the conversation context.
Abstract: Existing keyphrase extraction methods suffer from data sparsity problem when they are conducted on short and informal texts, especially microblog messages. Enriching context is one way to alleviate this problem. Considering that conversations are formed by reposting and replying messages, they provide useful clues for recognizing essential content in target posts and are therefore helpful for keyphrase identification. In this paper, we present a neural keyphrase extraction framework for microblog posts that takes their conversation context into account, where four types of neural encoders, namely, averaged embedding, RNN, attention, and memory networks, are proposed to represent the conversation context. Experimental results on Twitter and Weibo datasets show that our framework with such encoders outperforms state-of-the-art approaches.

Journal ArticleDOI
TL;DR: In this paper, a case study of the 2015 Tianjin explosions is used to explore resistance on Weibo and identify three discursive strategies of resistance in Weibo: (a) resisting by quoting cross-platform witness accounts; (b) creating rumors; (c) ridiculing the official discourse through satire.
Abstract: The internet and social media are increasingly seen as one of the major channels for public participation and through which anxiety and discontent about social and political issues are voiced and dealt with in non-democratic countries like China. Weibo, as a platform of this type, has been attracting growing attention in the fields of political communication and media studies in recent years. This study takes a discourse approach to explore resistance on Weibo by drawing upon a case study of the 2015 Tianjin explosions. Based on a discourse analysis of 1322 microblogs immediately following the explosions, this study unravels the details and dynamics of how Weibo users challenge the official discourse and offer an alternative discourse of the disaster online. It identifies three discursive strategies of resistance in Weibo: (a) resisting by quoting cross-platform witness accounts; (b) resisting by creating rumors; (c) resisting by ridiculing the official discourse through satire. These strategies of resistance exemplify how Chinese netizens actively use social media platforms to express sociopolitical arguments in non-democratic contexts, and in turn, reshape the power relations between the state and the public.

Journal ArticleDOI
TL;DR: The authors used a microblogging dictionary to analyze the content of tweets and found that the aggregate tone of Tweets contains significant information not in betting prices, particularly in the immediate aftermath of goals and red cards.
Abstract: Social media is now used as a forecasting tool by a variety of firms and agencies. But how useful are such data in forecasting outcomes? Can social media add any information to that produced by a prediction/betting market? We source 13.8 million posts from Twitter, and combine them with contemporaneous Betfair betting prices, to forecast the outcomes of English Premier League soccer matches as they unfold. Using a microblogging dictionary to analyze the content of Tweets, we find that the aggregate tone of Tweets contains significant information not in betting prices, particularly in the immediate aftermath of goals and red cards. (JEL G14, G17)

Journal ArticleDOI
02 Feb 2018-PLOS ONE
TL;DR: Different from previous work using direct user relations, this paper introduces structure similarity context into social contexts and proposes a method to measure structure similarity and also introduces topic context to model the semantic relations between microblogs.
Abstract: Analyzing massive user-generated microblogs is very crucial in many fields, attracting many researchers to study. However, it is very challenging to process such noisy and short microblogs. Most prior works only use texts to identify sentiment polarity and assume that microblogs are independent and identically distributed, which ignore microblogs are networked data. Therefore, their performance is not usually satisfactory. Inspired by two sociological theories (sentimental consistency and emotional contagion), in this paper, we propose a new method combining social context and topic context to analyze microblog sentiment. In particular, different from previous work using direct user relations, we introduce structure similarity context into social contexts and propose a method to measure structure similarity. In addition, we also introduce topic context to model the semantic relations between microblogs. Social context and topic context are combined by the Laplacian matrix of the graph built by these contexts and Laplacian regularization are added into the microblog sentiment analysis model. Experimental results on two real Twitter datasets demonstrate that our proposed model can outperform baseline methods consistently and significantly.

Journal ArticleDOI
TL;DR: This work proposes a novel generative method incorporating textual and visual information to solve the task of recommending hashtags for multimodal microblogs and demonstrates that the proposed method outperforms state-of-the-art methods using either textual or visual information.

Journal ArticleDOI
TL;DR: The sentiment of messages is positively affected with contemporaneous daily abnormal stock returns and that message volume predicts 15-min follow-up returns, trading volume, and volatility and an explanation for the efficient aggregation of information on microblogging platforms is offered.
Abstract: Scholars and practitioners alike increasingly recognize the importance of stock microblogs as they capture the market discussion and have predictive value for financial markets. This paper examines the extent to which stock microblog messages are related to financial market indicators and the mechanism leading to efficient aggregation of information. In particular, this paper investigates the information content of stock microblogs with respect to individual stocks and explores the effects of social influences on an interday and intraday basis. We collected more than 1.2 million stock-related messages (i.e., tweets) related to S&P 100 companies over a period of 7 months. Using methods from computational linguistics, we went through an elaborate process of message feature reduction, spam detection, language detection, and slang removal, which has led to an increase in classification accuracy for sentiment analysis. We analyzed the data on both a daily and a 15-min basis and found that the sentiment of messages is positively affected with contemporaneous daily abnormal stock returns and that message volume predicts 15-min follow-up returns, trading volume, and volatility. Disagreement in microblog messages positively influences stock features, both in interday and intraday analysis. Notably, if we give a greater share of voice to microblog messages depending on the social influence of microbloggers, this amplifies the relationship between bullishness and abnormal returns, market volume, and volatility. Following knowledgeable investors advice results in more power in explaining changes in market features. This offers an explanation for the efficient aggregation of information on microblogging platforms. Furthermore, we simulated a set of trading strategies using microblog features and the results suggest that it is possible to exploit market inefficiencies even when transaction costs are included. To our knowledge, this is the first study to comprehensively examine the association between the information content of stock microblogs and intraday stock market features. The insights from the study permit scholars and professionals to reliably identify stock microblog features, which may serve as valuable proxies for market sentiment and permit individual investors to make better investment decisions.

Journal ArticleDOI
TL;DR: A novel approach called Semi-Supervised Clue Fusion (SSCF) is proposed to conduct effective spammer detection in Sina Weibo and shows that this approach significantly outperforms state-of-the-art baselines.

Proceedings ArticleDOI
01 Jun 2018
TL;DR: A statistical model is proposed that jointly captures topics for representing user interests and conversation content, and discourse modes for describing user replying behavior and conversation dynamics that outperforms methods that only model content without considering discourse.
Abstract: Millions of conversations are generated every day on social media platforms. With limited attention, it is challenging for users to select which discussions they would like to participate in. Here we propose a new method for microblog conversation recommendation. While much prior work has focused on post-level recommendation, we exploit both the conversational context, and user content and behavior preferences. We propose a statistical model that jointly captures: (1) topics for representing user interests and conversation content, and (2) discourse modes for describing user replying behavior and conversation dynamics. Experimental results on two Twitter datasets demonstrate that our system outperforms methods that only model content without considering discourse.

Journal ArticleDOI
TL;DR: Concerns remain regarding how to assess the impact of journal social media outreach, abundant but unclear metrics, and the magnitude of benefit (if any), particularly given the substantial work required for substantive interactive engagement.
Abstract: Medical journals increasingly use social media to engage their audiences in a variety of ways, from simply broadcasting content via blogs, microblogs, and podcasts to more interactive methods such as Twitter chats and online journal clubs. Online discussion may increase readership and help improve peer review, for example, by providing postpublication peer review. Challenges remain, including the loss of nuance and context of shared work. Furthermore, uncertainty remains regarding how to assess the impact of journal social media outreach, abundant but unclear metrics, and the magnitude of benefit (if any), particularly given the substantial work required for substantive interactive engagement. Continued involvement and innovation from medical journals through social media offers potential in engaging journal audiences and improving knowledge translation.