Showing papers in &quot;Information Processing and Management in 2019&quot;

Deep learning-based sentiment classification of evaluative text based on Multi-feature fusion

TL;DR: This paper surveys QE techniques in IR from 1960 to 2017 with respect to core techniques, data sources used, weighting and ranking methodologies, user participation and applications – bringing out similarities and differences.

...read moreread less

Abstract: With the ever increasing size of the web, relevant information extraction on the Internet with a query formed by a few keywords has become a big challenge. Query Expansion (QE) plays a crucial role in improving searches on the Internet. Here, the user’s initial query is reformulated by adding additional meaningful terms with similar significance. QE – as part of information retrieval (IR) – has long attracted researchers’ attention. It has become very influential in the field of personalized social document, question answering, cross-language IR, information filtering and multimedia IR. Research in QE has gained further prominence because of IR dedicated conferences such as TREC (Text Information Retrieval Conference) and CLEF (Conference and Labs of the Evaluation Forum). This paper surveys QE techniques in IR from 1960 to 2017 with respect to core techniques, data sources used, weighting and ranking methodologies, user participation and applications – bringing out similarities and differences.

...read moreread less

219 citations

Journal Article•DOI•

[...]

Asad Abdi¹, Siti Mariyam Shamsuddin¹, Shafaatunnur Hasan¹, Jalil Piran²•Institutions (2)

Universiti Teknologi Malaysia¹, Sejong University²

A comprehensive survey of arabic sentiment analysis

TL;DR: Experimental results show that feature vectors in terms of statistical, linguistic and sentiment knowledge, sentiment shifter rules and word-embedding can improve the classification accuracy of sentence-level sentiment analysis, and the neural model yields superior performance improvements in comparison with other well-known approaches in the literature.

...read moreread less

Abstract: Sentiment analysis concerns the study of opinions expressed in a text. Due to the huge amount of reviews, sentiment analysis plays a basic role to extract significant information and overall sentiment orientation of reviews. In this paper, we present a deep-learning-based method to classify a user's opinion expressed in reviews (called RNSA). To the best of our knowledge, a deep learning-based method in which a unified feature set which is representative of word embedding, sentiment knowledge, sentiment shifter rules, statistical and linguistic knowledge, has not been thoroughly studied for a sentiment analysis. The RNSA employs the Recurrent Neural Network (RNN) which is composed by Long Short-Term Memory (LSTM) to take advantage of sequential processing and overcome several flaws in traditional methods, where order and information about the word are vanished. Furthermore, it uses sentiment knowledge, sentiment shifter rules and multiple strategies to overcome the following drawbacks: words with similar semantic context but opposite sentiment polarity; contextual polarity; sentence types; word coverage limit of an individual lexicon; word sense variations. To verify the effectiveness of our work, we conduct sentence-level sentiment classification on large-scale review datasets. We obtained encouraging result. Experimental results show that (1) feature vectors in terms of (a) statistical, linguistic and sentiment knowledge, (b) sentiment shifter rules and (c) word-embedding can improve the classification accuracy of sentence-level sentiment analysis; (2) our method that learns from this unified feature set can obtain significant performance than one that learns from a feature subset; (3) our neural model yields superior performance improvements in comparison with other well-known approaches in the literature.

...read moreread less

169 citations

Journal Article•DOI•

[...]

Mahmoud Al-Ayyoub¹, Abed Allah Khamaiseh¹, Yaser Jararweh¹, Mohammed N. Al-Kabi²•Institutions (2)

Jordan University of Science and Technology¹, AL Buraimi University College²

01 Mar 2019-Information Processing and Management

TL;DR: This survey presents a comprehensive overview of the works done so far on Arabic SA and tries to identify the gaps in the current literature laying foundation for future studies in this field.

...read moreread less

Abstract: Sentiment analysis (SA) is a continuing field of research that lies at the intersection of many fields such as data mining, natural language processing and machine learning It is concerned with the automatic extraction of opinions conveyed in a certain text Due to its vast applications, many studies have been conducted in the area of SA especially on English texts, while other languages such as Arabic received less attention This survey presents a comprehensive overview of the works done so far on Arabic SA (ASA) The survey groups published papers based on the SA-related problems they address and tries to identify the gaps in the current literature laying foundation for future studies in this field

...read moreread less

153 citations

Journal Article•DOI•

A framework for fake review detection in online consumer electronics retailers

[...]

Rodrigo Barbado¹, Oscar Araque¹, Carlos A. Iglesias¹•Institutions (1)

Technical University of Madrid¹

Real-time event detection from the Twitter data stream using the TwitterNews+ framework

TL;DR: This article proposes a feature framework for detecting fake reviews that has been evaluated in the consumer electronics domain and the Ada Boost classifier has been proven to be the best one by statistical means according to the Friedman test.

...read moreread less

Abstract: The impact of online reviews on businesses has grown significantly during last years, being crucial to determine business success in a wide array of sectors, ranging from restaurants, hotels to e-commerce. Unfortunately, some users use unethical means to improve their online reputation by writing fake reviews of their businesses or competitors. Previous research has addressed fake review detection in a number of domains, such as product or business reviews in restaurants and hotels. However, in spite of its economical interest, the domain of consumer electronics businesses has not yet been thoroughly studied. This article proposes a feature framework for detecting fake reviews that has been evaluated in the consumer electronics domain. The contributions are fourfold: (i) Construction of a dataset for classifying fake reviews in the consumer electronics domain in four different cities based on scraping techniques; (ii) definition of a feature framework for fake review detection; (iii) development of a fake review classification method based on the proposed framework and (iv) evaluation and analysis of the results for each of the cities under study. We have reached an 82% F-Score on the classification task and the Ada Boost classifier has been proven to be the best one by statistical means according to the Friedman test.

...read moreread less

149 citations

Journal Article•DOI•

[...]

Mahmud Hasan¹, Mehmet A. Orgun¹, Mehmet A. Orgun², Rolf Schwitter¹•Institutions (2)

Macquarie University¹, Macau University of Science and Technology²

Proppy: Organizing the news based on their propagandistic content

TL;DR: TwitterNews+, an event detection system that incorporates specialized inverted indices and an incremental clustering approach to provide a low computational cost solution to detect both major and minor newsworthy events in real-time from the Twitter data stream is proposed.

...read moreread less

Abstract: Detecting events in real-time from the Twitter data stream has gained substantial attention in recent years from researchers around the world. Different event detection approaches have been proposed as a result of these research efforts. One of the major challenges faced in this context is the high computational cost associated with event detection in real-time. We propose, TwitterNews+, an event detection system that incorporates specialized inverted indices and an incremental clustering approach to provide a low computational cost solution to detect both major and minor newsworthy events in real-time from the Twitter data stream. In addition, we conduct an extensive parameter sensitivity analysis to fine-tune the parameters used in TwitterNews+ to achieve the best performance. Finally, we evaluate the effectiveness of our system using a publicly available corpus as a benchmark dataset. The results of the evaluation show a significant improvement in terms of recall and precision over five state-of-the-art baselines we have used.

...read moreread less

137 citations

Journal Article•DOI•

[...]

Alberto Barrón-Cedeño¹, Israa Jaradat², Giovanni Da San Martino³, Preslav Nakov³•Institutions (3)

University of Bologna¹, University of Texas at Arlington², Qatar Computing Research Institute³

Aspect-based sentiment analysis with alternating coattention networks

TL;DR: A model to automatically assess the level of propagandistic content in an article based on different representations, from writing style and readability level to the presence of certain keywords is proposed.

...read moreread less

Abstract: Propaganda is a mechanism to influence public opinion, which is inherently present in extremely biased and fake news. Here, we propose a model to automatically assess the level of propagandistic content in an article based on different representations, from writing style and readability level to the presence of certain keywords. We experiment thoroughly with different variations of such a model on a new publicly available corpus, and we show that character n-grams and other style features outperform existing alternatives to identify propaganda based on word n-grams. Unlike previous work, we make sure that the test data comes from news sources that were unseen on training, thus penalizing learning algorithms that model the news sources used at training time as opposed to solving the actual task. We integrate our supervised model in a public website, which organizes recent articles covering the same event on the basis of their propagandistic contents. This allows users to quickly explore different perspectives of the same story, and it also enables investigative journalists to dig further into how different media use stories and propaganda to pursue their agenda.

...read moreread less

133 citations

Journal Article•DOI•

[...]

Chao Yang¹, Hefeng Zhang¹, Bin Jiang¹, Keqin Li², Keqin Li¹ - Show less +1 more•Institutions (2)

Hunan University¹, State University of New York System²

A tale of two epidemics: Contextual Word2Vec for classifying twitter streams during outbreaks

TL;DR: A coattention mechanism which models both target-level and context-level attention alternatively so as to focus on those key words of targets to learn more effective context representation and a Coattention-LSTM network which learns nonlinear representations of context and target simultaneously and can extracts more effective sentiment feature from coatt attention mechanism.

...read moreread less

Abstract: Aspect-based sentiment analysis aims to predict the sentiment polarities of specific targets in a given text. Recent researches show great interest in modeling the target and context with attention network to obtain more effective feature representation for sentiment classification task. However, the use of an average vector of target for computing the attention score for context is unfair. Besides, the interaction mechanism is simple thus need to be further improved. To solve the above problems, this paper first proposes a coattention mechanism which models both target-level and context-level attention alternatively so as to focus on those key words of targets to learn more effective context representation. On this basis, we implement a Coattention-LSTM network which learns nonlinear representations of context and target simultaneously and can extracts more effective sentiment feature from coattention mechanism. Further, a Coattention-MemNet network which adopts a multiple-hops coattention mechanism is proposed to improve the sentiment classification result. Finally, we propose a new location weighted function which considers the location information to enhance the performance of coattention mechanism. Extensive experiments on two public datasets demonstrate the effectiveness of all proposed methods, and our findings in the experiments provide new insight for future developments of using attention mechanism and deep neural network for aspect-based sentiment analysis.

...read moreread less

113 citations

Journal Article•DOI•

[...]

Aparup Khatua¹, Aparup Khatua², Apalak Khatua³, Erik Cambria¹•Institutions (3)

Nanyang Technological University¹, University of Calcutta², XLRI- Xavier School of Management³

01 Jan 2019-Information Processing and Management

TL;DR: The findings suggest that relatively smaller domain-specific input corpora from the Twitter corpus are better in extracting meaningful semantic relationship than generic pre-trained Word2Vec or GloVe, and the accuracy of word vectors for identifying crisis-related actionable tweets is explored.

...read moreread less

Abstract: Unstructured tweet feeds are becoming the source of real-time information for various events. However, extracting actionable information in real-time from this unstructured text data is a challenging task. Hence, researchers are employing word embedding approach to classify unstructured text data. We set our study in the contexts of the 2014 Ebola and 2016 Zika outbreaks and probed the accuracy of domain-specific word vectors for identifying crisis-related actionable tweets. Our findings suggest that relatively smaller domain-specific input corpora from the Twitter corpus are better in extracting meaningful semantic relationship than generic pre-trained Word2Vec (contrived from Google News) or GloVe (of Stanford NLP group). However, domain-specific quality tweet corpora during the early stages of outbreaks are normally scant, and identifying actionable tweets during early stages is crucial to stemming the proliferation of an outbreak. To overcome this challenge, we consider scholarly abstracts, related to Ebola and Zika virus, from PubMed and probe the efficiency of cross-domain resource utilization for word vector generation. Our findings demonstrate that the relevance of PubMed abstracts for the training purpose when Twitter data (as input corpus) would be scant during the early stages of the outbreak. Thus, this approach can be implemented to handle future outbreaks in real time. We also explore the accuracy of our word vectors for various model architectures and hyper-parameter settings. We observe that Skip-gram accuracies are better than CBOW, and higher dimensions yield better accuracy.

...read moreread less

111 citations

Journal Article•DOI•

Tracking community evolution in social networks: A survey

[...]

Narimene Dakiche¹, Fatima Benbouzid-Si Tayeb¹, Yahya Slimani, Karima Benatchba¹•Institutions (1)

École Normale Supérieure¹

Examining the relationship between specific negative emotions and the perceived helpfulness of online reviews

TL;DR: A classification of various methods for tracking community evolution in dynamic social networks into four main approaches using as a criterion the functioning principle is proposed, based on independent successive static detection and matching.

...read moreread less

Abstract: This paper presents a survey of previous studies done on the problem of tracking community evolution over time in dynamic social networks. This problem is of crucial importance in the field of social network analysis. The goal of our paper is to classify existing methods dealing with the issue. We propose a classification of various methods for tracking community evolution in dynamic social networks into four main approaches using as a criterion the functioning principle: the first one is based on independent successive static detection and matching; the second is based on dependent successive static detection; the third is based on simultaneous study of all stages of community evolution; finally, the fourth and last one concerns methods working directly on temporal networks. Our paper starts by giving basic concepts about social networks, community structure and strategies for evaluating community detection methods. Then, it describes the different approaches, and exposes the strengths as well as the weaknesses of each.

...read moreread less

109 citations

Journal Article•DOI•

[...]

Gang Ren¹, Taeho， Hong¹•Institutions (1)

College of Business Administration¹

Big data adoption: State of the art and research challenges

TL;DR: It is suggested that product type moderates the effects of emotions on perceived review helpfulness, and fear embedded in a review is identified as an important emotional cue to positively affect the perceivedreview helpfulness with more persuasive messages.

...read moreread less

Abstract: This paper extracted discrete emotions from online reviews based on an emotion classification approach, and examined the differential effects of three discrete emotions (anger, fear, sadness) on perceived review helpfulness. We empirically tested the hypotheses by analyzing the “verified purchase” reviews on Amazon.com. The findings of this study extend the previous research by suggesting that product type moderates the effects of emotions on perceived review helpfulness. Anger embedded in a customer review exerts a greater negative impact on perceived review helpfulness for experience goods than for search goods. Fear embedded in a review is identified as an important emotional cue to positively affect the perceived review helpfulness with more persuasive messages. As the level of sadness embedded in a review increases, perceived review helpfulness decreases. These findings contribute to a better understanding of the important role of emotions embedded in reviews on the perceived review helpfulness. This study also provides practical insights related to the presentation of online reviews and gives suggestions for consumers regarding how to select and write a helpful review.

...read moreread less

104 citations

Journal Article•DOI•

[...]

Maria Ijaz Baig¹, Liyana Shuib¹, Elaheh Yadegaridehkordi¹•Institutions (1)

Information Technology University¹

Irony detection via sentiment-based transfer learning

TL;DR: According to the findings, Technology–Organization–Environment and Diffusion of Innovations are the most popular theoretical models used for big data adoption in various domains and forty-two factors in technology, organization, environment, and innovation that have a significant influence onbig data adoption are revealed.

...read moreread less

Abstract: Big data adoption is a process through which businesses find innovative ways to enhance productivity and predict risk to satisfy customers need more efficiently. Despite the increase in demand and importance of big data adoption, there is still a lack of comprehensive review and classification of the existing studies in this area. This research aims to gain a comprehensive understanding of the current state-of-the-art by highlighting theoretical models, the influence factors, and the research challenges of big data adoption. By adopting a systematic selection process, twenty studies were identified in the domain of big data adoption and were reviewed in order to extract relevant information that answers a set of research questions. According to the findings, Technology–Organization–Environment and Diffusion of Innovations are the most popular theoretical models used for big data adoption in various domains. This research also revealed forty-two factors in technology, organization, environment, and innovation that have a significant influence on big data adoption. Finally, challenges found in the current research about big data adoption are represented, and future research directions are recommended. This study is helpful for researchers and stakeholders to take initiatives that will alleviate the challenges and facilitate big data adoption in various fields.

...read moreread less

Journal Article•DOI•

[...]

Shiwei Zhang¹, Xiuzhen Zhang¹, Jeffrey Chan¹, Paolo Rosso²•Institutions (2)

RMIT University¹, Polytechnic University of Valencia²

Attention-based long short-term memory network using sentiment lexicon embedding for aspect-level sentiment analysis in Korean

TL;DR: Three transfer learning-based approaches to using sentiment knowledge from external resources to improve the attention mechanism of recurrent neural models for capturing hidden patterns for incongruity are proposed.

...read moreread less

Abstract: Irony as a literary technique is widely used in online texts such as Twitter posts. Accurate irony detection is crucial for tasks such as effective sentiment analysis. A text’s ironic intent is defined by its context incongruity. For example in the phrase “I love being ignored”, the irony is defined by the incongruity between the positive word “love” and the negative context of “being ignored”. Existing studies mostly formulate irony detection as a standard supervised learning text categorization task, relying on explicit expressions for detecting context incongruity. In this paper we formulate irony detection instead as a transfer learning task where supervised learning on irony labeled text is enriched with knowledge transferred from external sentiment analysis resources. Importantly, we focus on identifying the hidden, implicit incongruity without relying on explicit incongruity expressions, as in “I like to think of myself as a broken down Justin Bieber – my philosophy professor.” We propose three transfer learning-based approaches to using sentiment knowledge to improve the attention mechanism of recurrent neural models for capturing hidden patterns for incongruity. Our main findings are: (1) Using sentiment knowledge from external resources is a very effective approach to improving irony detection; (2) For detecting implicit incongruity, transferring deep sentiment features seems to be the most effective way. Experiments show that our proposed models outperform state-of-the-art neural models for irony detection.

...read moreread less

Journal Article•DOI•

[...]

Minchae Song¹, Hyunjung Park¹, Kyung Shik Shin•Institutions (1)

Ewha Womans University¹

An image-text consistency driven multimodal sentiment analysis approach for social media

TL;DR: A method of sentiment lexicon embedding that better represents sentiment word's semantic relationships than existing word embedding techniques without manually-annotated sentiment corpus is proposed and improved the performance of sentiment classification.

...read moreread less

Abstract: Although deep learning breakthroughs in NLP are based on learning distributed word representations by neural language models, these methods suffer from a classic drawback of unsupervised learning techniques. Furthermore, the performance of general-word embedding has been shown to be heavily task-dependent. To tackle this issue, recent researches have been proposed to learn the sentiment-enhanced word vectors for sentiment analysis. However, the common limitation of these approaches is that they require external sentiment lexicon sources and the construction and maintenance of these resources involve a set of complexing, time-consuming, and error-prone tasks. In this regard, this paper proposes a method of sentiment lexicon embedding that better represents sentiment word's semantic relationships than existing word embedding techniques without manually-annotated sentiment corpus. The major distinguishing factor of the proposed framework was that joint encoding morphemes and their POS tags, and training only important lexical morphemes in the embedding space. To verify the effectiveness of the proposed method, we conducted experiments comparing with two baseline models. As a result, the revised embedding approach mitigated the problem of conventional context-based word embedding method and, in turn, improved the performance of sentiment classification.

...read moreread less

Journal Article•DOI•

[...]

Ziyuan Zhao¹, Huiying Zhu¹, Zehao Xue¹, Zhao Liu¹, Jing Tian¹, Matthew Chin Heng Chua¹, Maofu Liu² - Show less +3 more•Institutions (2)

National University of Singapore¹, Wuhan University of Science and Technology²

Cognitive-inspired domain adaptation of sentiment lexicons

TL;DR: The mid-level visual features extracted by the conventional SentiBank approach are used to represent visual concepts, with the integration of other features, including textual, visual and social features, to develop a machine learning sentiment analysis approach.

...read moreread less

Abstract: Social media users are increasingly using both images and text to express their opinions and share their experiences, instead of only using text in the conventional social media. Consequently, the conventional text-based sentiment analysis has evolved into more complicated studies of multimodal sentiment analysis. To tackle the challenge of how to effectively exploit the information from both visual content and textual content from image-text posts, this paper proposes a new image-text consistency driven multimodal sentiment analysis approach. The proposed approach explores the correlation between the image and the text, followed by a multimodal adaptive sentiment analysis method. To be more specific, the mid-level visual features extracted by the conventional SentiBank approach are used to represent visual concepts, with the integration of other features, including textual, visual and social features, to develop a machine learning sentiment analysis approach. Extensive experiments are conducted to demonstrate the superior performance of the proposed approach.

...read moreread less

Journal Article•DOI•

[...]

Frank Z. Xing¹, Filippo Pallucchini², Erik Cambria¹•Institutions (2)

Nanyang Technological University¹, Polytechnic University of Milan²

SRL-ESA-TextSum : a text summarization approach based on semantic role labeling and explicit semantic analysis

TL;DR: This article proposes a novel approach to simultaneously train a vanilla sentiment classifier and adapt word polarities to the target domain and sequentially track the wrongly predicted sentences and use them as the supervision instead of addressing the gold standard as a whole to emulate the life-long cognitive process of lexicon learning.

...read moreread less

Abstract: Sentiment lexicons are essential tools for polarity classification and opinion mining. In contrast to machine learning methods that only leverage text features or raw text for sentiment analysis, methods that use sentiment lexicons embrace higher interpretability. Although a number of domain-specific sentiment lexicons are made available, it is impractical to build an ex ante lexicon that fully reflects the characteristics of the language usage in endless domains. In this article, we propose a novel approach to simultaneously train a vanilla sentiment classifier and adapt word polarities to the target domain. Specifically, we sequentially track the wrongly predicted sentences and use them as the supervision instead of addressing the gold standard as a whole to emulate the life-long cognitive process of lexicon learning. An exploration-exploitation mechanism is designed to trade off between searching for new sentiment words and updating the polarity score of one word. Experimental results on several popular datasets show that our approach significantly improves the sentiment classification performance for a variety of domains by means of improving the quality of sentiment lexicons. Case-studies also illustrate how polarity scores of the same words are discovered for different domains.

...read moreread less

Journal Article•DOI•

[...]

Muhidin Mohamed, Mourad Oussalah¹•Institutions (1)

University of Oulu¹

Real-time processing of social media with SENTINEL: A syndromic surveillance system incorporating deep learning for health classification

TL;DR: The findings demonstrate the power of the role-based and vectorial semantic representation when combined with the crowd-sourced knowledge base in Wikipedia.

...read moreread less

Abstract: Automatic text summarization attempts to provide an effective solution to today’s unprecedented growth of textual data. This paper proposes an innovative graph-based text summarization framework for generic single and multi document summarization. The summarizer benefits from two well-established text semantic representation techniques; Semantic Role Labelling (SRL) and Explicit Semantic Analysis (ESA) as well as the constantly evolving collective human knowledge in Wikipedia. The SRL is used to achieve sentence semantic parsing whose word tokens are represented as a vector of weighted Wikipedia concepts using ESA method. The essence of the developed framework is to construct a unique concept graph representation underpinned by semantic role-based multi-node (under sentence level) vertices for summarization. We have empirically evaluated the summarization system using the standard publicly available dataset from Document Understanding Conference 2002 (DUC 2002). Experimental results indicate that the proposed summarizer outperforms all state-of-the-art related comparators in the single document summarization based on the ROUGE-1 and ROUGE-2 measures, while also ranking second in the ROUGE-1 and ROUGE-SU4 scores for the multi-document summarization. On the other hand, the testing also demonstrates the scalability of the system, i.e., varying the evaluation data size is shown to have little impact on the summarizer performance, particularly for the single document summarization task. In a nutshell, the findings demonstrate the power of the role-based and vectorial semantic representation when combined with the crowd-sourced knowledge base in Wikipedia.

...read moreread less

Journal Article•DOI•

[...]

Ovidiu Șerban¹, Nicholas Thapen¹, Brendan Maginnis¹, Chris Hankin¹, Virginia Foot² - Show less +1 more•Institutions (2)

Imperial College London¹, Defence Science and Technology Laboratory²

A novel intelligent classification model for breast cancer diagnosis

TL;DR: The preliminary results are promising, with the system being able to detect outbreaks of influenza-like illness symptoms which could then be confirmed by existing official sources, and shows that using social media data can improve prediction for multiple diseases over simply using traditional data sources.

...read moreread less

Abstract: Interest in real-time syndromic surveillance based on social media data has greatly increased in recent years. The ability to detect disease outbreaks earlier than traditional methods would be highly useful for public health officials. This paper describes a software system which is built upon recent developments in machine learning and data processing to achieve this goal. The system is built from reusable modules integrated into data processing pipelines that are easily deployable and configurable. It applies deep learning to the problem of classifying health-related tweets and is able to do so with high accuracy. It has the capability to detect illness outbreaks from Twitter data and then to build up and display information about these outbreaks, including relevant news articles, to provide situational awareness. It also provides nowcasting functionality of current disease levels from previous clinical data combined with Twitter data. The preliminary results are promising, with the system being able to detect outbreaks of influenza-like illness symptoms which could then be confirmed by existing official sources. The Nowcasting module shows that using social media data can improve prediction for multiple diseases over simply using traditional data sources.

...read moreread less

Journal Article•DOI•

[...]

Na Liu¹, Na Liu², Er-Shi Qi¹, Man Xu³, Bo Gao⁴, Gui-Qiu Liu⁵ - Show less +2 more•Institutions (5)

College of Management and Economics¹, Shihezi University², Nankai University³, Anhui University⁴, Tianjin Medical University⁵

Fuzzy topic modeling approach for text mining over short text

TL;DR: The proposed hybrid algorithm can not only help to reduce the complexity of SAGASW algorithm and effectively extracting the optimal feature subset to a certain extent, but it can also obtain the maximum classification accuracy and minimum misclassification cost.

...read moreread less

Abstract: Breast cancer is one of the leading causes of death among women worldwide. Accurate and early detection of breast cancer can ensure long-term surviving for the patients. However, traditional classification algorithms usually aim only to maximize the classification accuracy, failing to take into consideration the misclassification costs between different categories. Furthermore, the costs associated with missing a cancer case (false negative) are clearly much higher than those of mislabeling a benign one (false positive). To overcome this drawback and further improving the classification accuracy of the breast cancer diagnosis, in this work, a novel breast cancer intelligent diagnosis approach has been proposed, which employed information gain directed simulated annealing genetic algorithm wrapper (IGSAGAW) for feature selection, in this process, we performs the ranking of features according to IG algorithm, and extracting the top m optimal feature utilized the cost sensitive support vector machine (CSSVM) learning algorithm. Our proposed feature selection approach which can not only help to reduce the complexity of SAGASW algorithm and effectively extracting the optimal feature subset to a certain extent, but it can also obtain the maximum classification accuracy and minimum misclassification cost. The efficacy of our proposed approach is tested on Wisconsin Original Breast Cancer (WBC) and Wisconsin Diagnostic Breast Cancer (WDBC) breast cancer data sets, and the results demonstrate that our proposed hybrid algorithm outperforms other comparison methods. The main objective of this study was to apply our research in real clinical diagnostic system and thereby assist clinical physicians in making correct and effective decisions in the future. Moreover our proposed method could also be applied to other illness diagnosis.

...read moreread less

Journal Article•DOI•

[...]

Junaid Rashid¹, Syed Muhammad Adnan Shah¹, Aun Irtaza¹•Institutions (1)

University of Engineering and Technology¹

An unsupervised aspect extraction strategy for monitoring real-time reviews stream

TL;DR: The sparsity problem is ameliorated by presenting a novel fuzzy topic modeling (FTM) approach for short text through fuzzy perspective and the classification accuracies of FTM on snippets and questions datasets are higher than state-of-the-art baseline topic models.

...read moreread less

Abstract: In this era, the proliferating role of social media in our lives has popularized the posting of the short text. The short texts contain limited context with unique characteristics which makes them difficult to handle. Every day billions of short texts are produced in the form of tags, keywords, tweets, phone messages, messenger conversations social network posts, etc. The analysis of these short texts is imperative in the field of text mining and content analysis. The extraction of precise topics from large-scale short text documents is a critical and challenging task. The conventional approaches fail to obtain word co-occurrence patterns in topics due to the sparsity problem in short texts, such as text over the web, social media like Twitter, and news headlines. Therefore, in this paper, the sparsity problem is ameliorated by presenting a novel fuzzy topic modeling (FTM) approach for short text through fuzzy perspective. In this research, the local and global term frequencies are computed through a bag-of-words (BOW) model. To remove the negative impact of high dimensionality on the global term weighting, the principal component analysis is adopted; thereafter the fuzzy c-means algorithm is employed to retrieve the semantically relevant topics from the documents. The experiments are conducted over the three real-world short text datasets: the snippets dataset is in the category of small dataset whereas the other two datasets, Twitter and questions, are the bigger datasets. Experimental results show that the proposed approach discovered the topics more precisely and performed better as compared to other state-of-the-art baseline topic models such as GLTM, CSTM, LTM, LDA, Mix-gram, BTM, SATM, and DREx+LDA. The performance of FTM is also demonstrated in classification, clustering, topic coherence and execution time. FTM classification accuracy is 0.95, 0.94, 0.91, 0.89 and 0.87 on snippets dataset with 50, 75, 100, 125 and 200 number of topics. The classification accuracy of FTM on questions dataset is 0.73, 0.74, 0.70, 0.68 and 0.78 with 50, 75, 100, 125 and 200 number of topics. The classification accuracies of FTM on snippets and questions datasets are higher than state-of-the-art baseline topic models.

...read moreread less

Journal Article•DOI•

[...]

Mauro Dragoni¹, Marco Federici², Andi Rexha•Institutions (2)

fondazione bruno kessler¹, University of Amsterdam²

Textual keyword extraction and summarization: State-of-the-art

TL;DR: An opinion monitoring service implementing a set of unsupervised strategies for aspect-based opinion mining together with a monitoring tool supporting users in visualizing analyzed data and the effectiveness has been compared with the results obtained by domain-adapted techniques.

...read moreread less

Abstract: One of the most important opinion mining research directions falls in the extraction of polarities referring to specific entities (aspects) contained in the analyzed texts. The detection of such aspects may be very critical especially when documents come from unknown domains. Indeed, while in some contexts it is possible to train domain-specific models for improving the effectiveness of aspects extraction algorithms, in others the most suitable solution is to apply unsupervised techniques by making such algorithms domain-independent and more efficient in a real-time environment. Moreover, an emerging need is to exploit the results of aspect-based analysis for triggering actions based on these data. This led to the necessity of providing solutions supporting both an effective analysis of user-generated content and an efficient and intuitive way of visualizing collected data. In this work, we implemented an opinion monitoring service implementing (i) a set of unsupervised strategies for aspect-based opinion mining together with (ii) a monitoring tool supporting users in visualizing analyzed data. The aspect extraction strategies are based on the use of an open information extraction strategy. The effectiveness of the platform has been tested on benchmarks provided by the SemEval campaign and have been compared with the results obtained by domain-adapted techniques.

...read moreread less

Journal Article•DOI•

[...]

Zara Nasar¹, Syed Waqar Jaffry¹, Muhammad Kamran Malik¹•Institutions (1)

College of Information Technology¹

Personalized recommendation via user preference matching

TL;DR: This study aims to present an overview of approaches that can be applied to extract and later present these valuable information nuggets residing within text in brief, clear and concise way.

...read moreread less

Abstract: With the advent of Web 2.0, there exist many online platforms that results in massive textual data production such as social networks, online blogs, magazines etc. This textual data carries information that can be used for betterment of humanity. Hence, there is a dire need to extract potential information out of it. This study aims to present an overview of approaches that can be applied to extract and later present these valuable information nuggets residing within text in brief, clear and concise way. In this regard, two major tasks of automatic keyword extraction and text summarization are being reviewed. To compile the literature, scientific articles were collected using major digital computing research repositories. In the light of acquired literature, survey study covers early approaches up to all the way till recent advancements using machine learning solutions. Survey findings conclude that annotated benchmark datasets for various textual data-generators such as twitter and social forms are not available. This scarcity of dataset has resulted into relatively less progress in many domains. Also, applications of deep learning techniques for the task of automatic keyword extraction are relatively unaddressed. Hence, impact of various deep architectures stands as an open research direction. For text summarization task, deep learning techniques are applied after advent of word vectors, and are currently governing state-of-the-art for abstractive summarization. Currently, one of the major challenges in these tasks is semantic aware evaluation of generated results.

...read moreread less

Journal Article•DOI•

[...]

Wen Zhou¹, Wenbo Han¹•Institutions (1)

Shanghai University¹

ATM: Adversarial-neural Topic Model

TL;DR: A novel graph-based ranking oriented recommendation algorithm that exploits both explicit and implicit feedback of users, and utilizes a user-preference-item tripartite graph model and modified resource allocation process to match the target user with users who share similar preferences, and make personalized recommendations.

...read moreread less

Abstract: Graph-based recommendation approaches use a graph model to represent the relationships between users and items, and exploit the graph structure to make recommendations. Recent graph-based recommendation approaches focused on capturing users’ pairwise preferences and utilized a graph model to exploit the relationships between different entities in the graph. In this paper, we focus on the impact of pairwise preferences on the diversity of recommendations. We propose a novel graph-based ranking oriented recommendation algorithm that exploits both explicit and implicit feedback of users. The algorithm utilizes a user-preference-item tripartite graph model and modified resource allocation process to match the target user with users who share similar preferences, and make personalized recommendations. The principle of the additional preference layer is to capture users’ pairwise preferences, provide detailed information of users for further recommendations. Empirical analysis of four benchmark datasets demonstrated that our proposed algorithm performs better in most situations than other graph-based and ranking-oriented benchmark algorithms.

...read moreread less

Journal Article•DOI•

[...]

Rui Wang¹, Deyu Zhou¹, Yulan He²•Institutions (2)

Southeast University¹, University of Warwick²

A multi-centrality index for graph-based keyword extraction

TL;DR: The proposed Adversarial-neural Topic Model (ATM) models topics with Dirichlet prior and employs a generator network to capture the semantic patterns among latent topics, and shows that ATM generates more coherence topics, outperforming a number of competitive baselines.

...read moreread less

Abstract: Topic models are widely used for thematic structure discovery in text. But traditional topic models often require dedicated inference procedures for specific tasks at hand. Also, they are not designed to generate word-level semantic representations. To address the limitations, we propose a neural topic modeling approach based on the Generative Adversarial Nets (GANs), called Adversarial-neural Topic Model (ATM) in this paper. To our best knowledge, this work is the first attempt to use adversarial training for topic modeling. The proposed ATM models topics with dirichlet prior and employs a generator network to capture the semantic patterns among latent topics. Meanwhile, the generator could also produce word-level semantic representations. Besides, to illustrate the feasibility of porting ATM to tasks other than topic modeling, we apply ATM for open domain event extraction. To validate the effectiveness of the proposed ATM, two topic modeling benchmark corpora and an event dataset are employed in the experiments. Our experimental results on benchmark corpora show that ATM generates more coherence topics (considering five topic coherence measures), outperforming a number of competitive baselines. Moreover, the experiments on event dataset also validate that the proposed approach is able to extract meaningful events from news articles.

...read moreread less

Journal Article•DOI•

[...]

Didier A. Vega-Oliveros¹, Didier A. Vega-Oliveros², Pedro Spoljaric Gomes³, Evangelos E. Milios⁴, Lilian Berton³ - Show less +1 more•Institutions (4)

Indiana University¹, University of São Paulo², Federal University of São Paulo³, Dalhousie University⁴

Understanding the topic evolution of scientific literatures like an evolving city: Using Google Word2Vec model and spatial autocorrelation analysis

TL;DR: The multi-centrality index (MCI) approach is presented, which aims to find the optimal combination of word rankings according to the selection of centrality measures in co-occurrence word-graphs representation of documents.

...read moreread less

Abstract: Keyword extraction aims to capture the main topics of a document and is an important step in natural language processing (NLP) applications. The use of different graph centrality measures has been proposed to extract automatic keywords. However, there is no consensus yet on how these measures compare in this task. Here, we present the multi-centrality index (MCI) approach, which aims to find the optimal combination of word rankings according to the selection of centrality measures. We analyze nine centrality measures (Betweenness, Clustering Coefficient, Closeness, Degree, Eccentricity, Eigenvector, K-Core, PageRank, Structural Holes) for identifying keywords in co-occurrence word-graphs representation of documents. We perform experiments on three datasets of documents and demonstrate that all individual centrality methods achieve similar statistical results, while the proposed MCI approach significantly outperforms the individual centralities, three clustering algorithms, and previously reported results in the literature.

...read moreread less

Journal Article•DOI•

[...]

Kai Hu¹, Qing Luo², Kunlun Qi³, Siluo Yang², Jin Mao², Xiaokang Fu², Jie Zheng², Huayi Wu², Ya Guo¹, Zhu Qibing¹ - Show less +6 more•Institutions (3)

Jiangnan University¹, Wuhan University², China University of Geosciences (Wuhan)³

The impact of deep learning on document classification using semantically rich representations

TL;DR: The Google Word2Vec, a deep learning language model, is applied to enhance the keywords with more complete semantic information to develop the semantic space as an urban geographic space, as if keywords are the changing lands in an evolving city.

...read moreread less

Abstract: Topic evolution has been described by many approaches from a macro level to a detail level, by extracting topic dynamics from text in literature and other media types. However, why the evolution happens is less studied. In this paper, we focus on whether and how the keyword semantics can invoke or affect the topic evolution. We assume that the semantic relatedness among the keywords can affect topic popularity during literature surveying and citing process, thus invoking evolution. However, the assumption is needed to be confirmed in an approach that fully considers the semantic interactions among topics. Traditional topic evolution analyses in scientometric domains cannot provide such support because of using limited semantic meanings. To address this problem, we apply the Google Word2Vec, a deep learning language model, to enhance the keywords with more complete semantic information. We further develop the semantic space as an urban geographic space. We analyze the topic evolution geographically using the measures of spatial autocorrelation, as if keywords are the changing lands in an evolving city. The keyword citations (keyword citation counts one when the paper containing this keyword obtains a citation) are used as an indicator of keyword popularity. Using the bibliographical datasets of the geographical natural hazard field, experimental results demonstrate that in some local areas, the popularity of keywords is affecting that of the surrounding keywords. However, there are no significant impacts on the evolution of all keywords. The spatial autocorrelation analysis identifies the interaction patterns (including High-High leading, High-Low suppressing) among the keywords in local areas. This approach can be regarded as an analyzing framework borrowed from geospatial modeling. Moreover, the prediction results in local areas are demonstrated to be more accurate if considering the spatial autocorrelations.

...read moreread less

Journal Article•DOI•

[...]

Zenun Kastrati¹, Ali Shariq Imran¹, Sule Yildirim Yayilgan¹•Institutions (1)

Norwegian University of Science and Technology¹

Towards early identification of online rumors based on long short-term memory networks

TL;DR: A comparative performance evaluation using various state-of-the-art document representation approaches and classification techniques including shallow and conventional machine learning classifiers reveals that a three hidden layer feedforward network with 1024 neurons obtain the highest document classification performance on the INFUSE dataset.

...read moreread less

Abstract: This paper presents a semantically rich document representation model for automatically classifying financial documents into predefined categories utilizing deep learning. The model architecture consists of two main modules including document representation and document classification. In the first module, a document is enriched with semantics using background knowledge provided by an ontology and through the acquisition of its relevant terminology. Acquisition of terminology integrated to the ontology extends the capabilities of semantically rich document representations with an in depth-coverage of concepts, thereby capturing the whole conceptualization involved in documents. Semantically rich representations obtained from the first module will serve as input to the document classification module which aims at finding the most appropriate category for that document through deep learning. Three different deep learning networks each belonging to a different category of machine learning techniques for ontological document classification using a real-life ontology are used. Multiple simulations are carried out with various deep neural networks configurations, and our findings reveal that a three hidden layer feedforward network with 1024 neurons obtain the highest document classification performance on the INFUSE dataset. The performance in terms of F1 score is further increased by almost five percentage points to 78.10% for the same network configuration when the relevant terminology integrated to the ontology is applied to enrich document representation. Furthermore, we conducted a comparative performance evaluation using various state-of-the-art document representation approaches and classification techniques including shallow and conventional machine learning classifiers.

...read moreread less

Journal Article•DOI•

[...]

Yahui Liu¹, Xiaolong Jin², Huawei Shen²•Institutions (2)

Shihezi University¹, Chinese Academy of Sciences²

On fine-grained geolocalisation of tweets and real-time traffic incident detection

TL;DR: Long Short-Term Memory network based models for identifying rumors by capturing the dynamic changes of forwarding contents, spreaders and diffusion structures of the whole or only the beginning part of the spreading process are proposed.

...read moreread less

Abstract: In the social media environment, rumors are constantly breeding and rapidly spreading, which has become a severe social problem, often leading to serious consequences (e.g., social panic and even chaos). Therefore, how to identify rumors quickly and accurately has become a key prerequisite for taking effective measures to curb the spread of rumors and reduce their influence. However, most existing studies employ machine learning based methods to carry out automatic rumor identification by extracting features of rumor contents, posters, and static spreading processes (e.g., follow-ups, thumb-ups, etc.) or by learning the presentation of forwarding contents. These studies fail to take into account the dynamic differences between the spreaders and diffusion structures of rumors and non-rumors. To fill this gap, this paper proposes Long Short-Term Memory (LSTM) network based models for identifying rumors by capturing the dynamic changes of forwarding contents, spreaders and diffusion structures of the whole (in the afterwards identification mode) or only the beginning part (in the halfway identification mode, i.e., early rumor identification) of the spreading process. Experiments conducted on a rumor and non-rumor dataset from Sina Weibo show that the proposed models perform better than existing baselines.

...read moreread less

Journal Article•DOI•

[...]

Jorge David Gonzalez Paule¹, Yeran Sun¹, Yashar Moshfeghi²•Institutions (2)

University of Glasgow¹, University of Strathclyde²

Information availability and return volatility in the bitcoin Market: Analyzing differences of user opinion and interest

TL;DR: This paper proposes a location inference method that utilises a ranking approach combined with a majority voting of tweets, where each vote is weighted based on evidence gathered from the ranking, that can overcome the limitations of geotagged tweets and precisely map incident-related tweets at the real location of the incident.

...read moreread less

Abstract: Recently, geolocalisation of tweets has become important for a wide range of real-time applications, including real-time event detection, topic detection or disaster and emergency analysis. However, the number of relevant geotagged tweets available to enable such tasks remains insufficient. To overcome this limitation, predicting the location of non-geotagged tweets, while challenging, can increase the sample of geotagged data and has consequences for a wide range of applications. In this paper, we propose a location inference method that utilises a ranking approach combined with a majority voting of tweets, where each vote is weighted based on evidence gathered from the ranking. Using geotagged tweets from two cities, Chicago and New York (USA), our experimental results demonstrate that our method (statistically) significantly outperforms state-of-the-art baselines in terms of accuracy and error distance, in both cities, with the cost of decreased coverage. Finally, we investigated the applicability of our method in a real-time scenario by means of a traffic incident detection task. Our analysis shows that our fine-grained geolocalisation method can overcome the limitations of geotagged tweets and precisely map incident-related tweets at the real location of the incident.

...read moreread less

Journal Article•DOI•

[...]

Ju Hyun Yu¹, Juyoung Kang¹, Sangun Park²•Institutions (2)

Ajou University¹, Kyonggi University²

Sentence modeling via multiple word embeddings and multi-level comparison for semantic textual similarity

TL;DR: This study analyzes the relationship between returns volatility and two independent variables, user opinion difference and user interest in the rapidly growing Bitcoin market and finds that the Bitcoin market exhibits greater market efficiency than the general financial market.

...read moreread less

Abstract: We study returns volatility and information availability in the rapidly growing Bitcoin market. The market microstructure for bitcoins is highly developed in terms of information generation and transfer. Therefore, returns volatility is highly likely to be affected by market information availability. We analyze the relationship between returns volatility and two independent variables, user opinion difference and user interest. This study adopted the GJR-GARCH model, appropriate for volatility studies. First, we find that, for volatility asymmetry, the Bitcoin market exhibits greater market efficiency than the general financial market. Also, the persistence of volatility is greater. The Bitcoin market is still relatively unregulated; hence, studying the relationship between information asymmetry and regulation in the market is salient. Moreover, we can infer that the ratio of reasonable users is high in the Bitcoin market. Second, the Bitcoin market supports the sequential information arrival hypothesis in that day trading volume, which is a proxy for user differences of opinion, has a statistically significant effect on returns volatility. Third, for the proxies of user interest, namely, the growth rate of page views on Google Trends and Wikipedia, only the growth rate of Google Trends shows statistically significant effects on Bitcoin returns volatility. This study can provide useful information to the financial market and policy makers on the behavior of the Bitcoin market, which may help to lower future entry barriers and opportunity costs of the Bitcoin market.

...read moreread less

Journal Article•DOI•

[...]

Nguyen Huy Tien¹, Nguyen Minh Le¹, Yamasaki Tomohiro², Izuha Tatsuya²•Institutions (2)

Japan Advanced Institute of Science and Technology¹, Toshiba²