Showing papers on "Latent Dirichlet allocation published in 2019"

PDF

Open Access

Journal Article•DOI•

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

[...]

Hamed Jelodar¹, Yongli Wang¹, Chi Yuan¹, Xia Feng¹, Xiahui Jiang¹, Yanchao Li¹, Liang Zhao¹ - Show less +3 more•Institutions (1)

Nanjing University of Science and Technology¹

01 Jun 2019-Multimedia Tools and Applications

TL;DR: In this article, the authors investigated highly scholarly articles (between 2003 to 2016) related to topic modeling based on LDA to discover the research development, current trends and intellectual structure of topic modeling.

...read moreread less

Abstract: Topic modeling is one of the most powerful techniques in text mining for data mining, latent data discovery, and finding relationships among data and text documents. Researchers have published many articles in the field of topic modeling and applied in various fields such as software engineering, political science, medical and linguistic science, etc. There are various methods for topic modelling; Latent Dirichlet Allocation (LDA) is one of the most popular in this field. Researchers have proposed various models based on the LDA in topic modeling. According to previous work, this paper will be very useful and valuable for introducing LDA approaches in topic modeling. In this paper, we investigated highly scholarly articles (between 2003 to 2016) related to topic modeling based on LDA to discover the research development, current trends and intellectual structure of topic modeling. In addition, we summarize challenges and introduce famous tools and datasets in topic modeling based on LDA.

...read moreread less

608 citations

Journal Article•DOI•

Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec

[...]

Donghwa Kim¹, Deokseong Seo¹, Suhyoun Cho¹, Pilsung Kang¹•Institutions (1)

Korea University¹

01 Mar 2019-Information Sciences

TL;DR: This paper transforms a document using three document representation methods: term frequency–inverse document frequency (TF–IDF) based on the bag-of-words scheme, topic distribution based on latent Dirichlet allocation (LDA), and neural-network-based document embedding known as document to vector (Doc2Vec).

...read moreread less

270 citations

Journal Article•DOI•

Wisdom of crowds: Conducting importance-performance analysis (IPA) through online reviews

[...]

Jian-Wu Bi¹, Yang Liu¹, Zhi-Ping Fan¹, Jin Zhang¹•Institutions (1)

Northeastern University (China)¹

01 Feb 2019-Tourism Management

TL;DR: The results indicate that the proposed methodology can obtain effective analysis results with lower cost and shorter time since online reviews are publicly available and easily collected.

...read moreread less

165 citations

Journal Article•DOI•

Analysing TripAdvisor reviews of tourist attractions in Phuket, Thailand.

[...]

Viriya Taecharungroj¹, Boonyanit Mathayomchan¹•Institutions (1)

Mahidol University International College¹

01 Dec 2019-Tourism Management

TL;DR: A methodology that can analyse online reviews using machine learning techniques in such a way that practitioners in the fields of tourism and destination management can understand and apply the technique to improve their attractions is developed.

...read moreread less

165 citations

Journal Article•DOI•

Understanding the use of Virtual Reality in Marketing: a text mining-based review

[...]

Sandra Loureiro¹, João Guerreiro¹, Sara Eloy¹, Daniela Langaro¹, Padma Panchapakesan² - Show less +1 more•Institutions (2)

ISCTE – University Institute of Lisbon¹, Taylors University²

01 Jul 2019-Journal of Business Research

TL;DR: A text-mining approach using a Bayesian statistical topic model called latent Dirichlet allocation is employed to conduct a comprehensive analysis of 150 articles from 115 journals, revealing seven relevant topics.

...read moreread less

162 citations

Journal Article•DOI•

Smart literature review: a practical topic modelling approach to exploratory literature review

[...]

Claus Boye Asmussen¹, Charles Møller¹•Institutions (1)

Aalborg University¹

19 Oct 2019-Journal of Big Data

TL;DR: The aim of the paper is to enable the use of topic modelling for researchers by presenting a step-by-step framework on a case and sharing a code template, which enables huge amounts of papers to be reviewed in a transparent, reliable, faster, and reproducible way.

...read moreread less

Abstract: Manual exploratory literature reviews should be a thing of the past, as technology and development of machine learning methods have matured. The learning curve for using machine learning methods is rapidly declining, enabling new possibilities for all researchers. A framework is presented on how to use topic modelling on a large collection of papers for an exploratory literature review and how that can be used for a full literature review. The aim of the paper is to enable the use of topic modelling for researchers by presenting a step-by-step framework on a case and sharing a code template. The framework consists of three steps; pre-processing, topic modelling, and post-processing, where the topic model Latent Dirichlet Allocation is used. The framework enables huge amounts of papers to be reviewed in a transparent, reliable, faster, and reproducible way.

...read moreread less

156 citations

Journal Article•DOI•

Topic modeling and sentiment analysis of global climate change tweets

[...]

Biraj Dahal¹, Sathish A. P. Kumar², Zhenlong Li³•Institutions (3)

Clemson University¹, Coastal Carolina University², University of South Carolina³

07 Oct 2019-Social Network Analysis and Mining

TL;DR: A large dataset of geotagged tweets containing certain keywords relating to climate change is analyzed using volume analysis and text mining techniques such as topic modeling and sentiment analysis to compare and contrast the nature of climate change discussion between different countries and over time.

...read moreread less

Abstract: Social media websites can be used as a data source for mining public opinion on a variety of subjects including climate change. Twitter, in particular, allows for the evaluation of public opinion across both time and space because geotagged tweets include timestamps and geographic coordinates (latitude/longitude). In this study, a large dataset of geotagged tweets containing certain keywords relating to climate change is analyzed using volume analysis and text mining techniques such as topic modeling and sentiment analysis. Latent Dirichlet allocation was applied for topic modeling to infer the different topics of discussion, and Valence Aware Dictionary and sEntiment Reasoner was applied for sentiment analysis to determine the overall feelings and attitudes found in the dataset. These techniques are used to compare and contrast the nature of climate change discussion between different countries and over time. Sentiment analysis shows that the overall discussion is negative, especially when users are reacting to political or extreme weather events. Topic modeling shows that the different topics of discussion on climate change are diverse, but some topics are more prevalent than others. In particular, the discussion of climate change in the USA is less focused on policy-related topics than other countries.

...read moreread less

151 citations

Journal Article•DOI•

Text-based crude oil price forecasting: A deep learning approach

[...]

Xuerong Li¹, Wei Shang¹, Shouyang Wang¹•Institutions (1)

Chinese Academy of Sciences¹

01 Oct 2019-International Journal of Forecasting

TL;DR: This study proposes a feature grouping method based on the Latent Dirichlet Allocation (LDA) topic model for distinguishing effects from various online news topics and suggests that the proposed topic-sentiment synthesis forecasting models perform better than the older benchmark models.

...read moreread less

128 citations

Journal Article•DOI•

Research paper classification systems based on TF-IDF and LDA schemes

[...]

Sang-Woon Kim¹, Joon-Min Gil¹•Institutions (1)

The Catholic University of America¹

26 Aug 2019-Human-centric Computing and Information Sciences

TL;DR: A research paper classification system that can cluster research papers into the meaningful class in which papers are very likely to have similar subjects is proposed.

...read moreread less

Abstract: With the increasing advance of computer and information technologies, numerous research papers have been published online as well as offline, and as new research fields have been continuingly created, users have a lot of trouble in finding and categorizing their interesting research papers. In order to overcome the limitations, this paper proposes a research paper classification system that can cluster research papers into the meaningful class in which papers are very likely to have similar subjects. The proposed system extracts representative keywords from the abstracts of each paper and topics by Latent Dirichlet allocation (LDA) scheme. Then, the K-means clustering algorithm is applied to classify the whole papers into research papers with similar subjects, based on the Term frequency-inverse document frequency (TF-IDF) values of each paper.

...read moreread less

118 citations

Journal Article•DOI•

Transportation sentiment analysis using word embedding and ontology-based topic modeling

[...]

Farman Ali¹, Daehan Kwak², Pervez Khan³, Shaker El-Sappagh⁴, Shaker El-Sappagh¹, Amjad Ali¹, Amjad Ali⁵, Sana Ullah⁶, Sana Ullah⁷, Kyehyun Kim¹, Kyung Sup Kwak¹ - Show less +7 more•Institutions (7)

Inha University¹, Kean University², Information Technology University³, Banha University⁴, COMSATS Institute of Information Technology⁵, Gyeongsang National University⁶, University of Swat⁷

15 Jun 2019-Knowledge Based Systems

TL;DR: This work proposes an ontology and latent Dirichlet allocation (OLDA)-based topic modeling and word embedding approach for sentiment classification, which achieves accuracy of 93%, which shows that the proposed approach is effective for sentiment Classification.

...read moreread less

Abstract: Social networks play a key role in providing a new approach to collecting information regarding mobility and transportation services. To study this information, sentiment analysis can make decent observations to support intelligent transportation systems (ITSs) in examining traffic control and management systems. However, sentiment analysis faces technical challenges: extracting meaningful information from social network platforms, and the transformation of extracted data into valuable information. In addition, accurate topic modeling and document representation are other challenging tasks in sentiment analysis. We propose an ontology and latent Dirichlet allocation (OLDA)-based topic modeling and word embedding approach for sentiment classification. The proposed system retrieves transportation content from social networks, removes irrelevant content to extract meaningful information, and generates topics and features from extracted data using OLDA. It also represents documents using word embedding techniques, and then employs lexicon-based approaches to enhance the accuracy of the word embedding model. The proposed ontology and the intelligent model are developed using Web Ontology Language and Java, respectively. Machine learning classifiers are used to evaluate the proposed word embedding system. The method achieves accuracy of 93%, which shows that the proposed approach is effective for sentiment classification.

...read moreread less

113 citations

Journal Article•DOI•

Latent Dirichlet Allocation (LDA) for Topic Modeling of the CFPB Consumer Complaints

[...]

Kaveh Bastani, Hamed Namavari¹, Jeffrey Shaffer¹•Institutions (1)

University of Cincinnati¹

01 Aug 2019-Expert Systems With Applications

TL;DR: In this paper, an intelligent approach based on latent Dirichlet allocation (LDA) was proposed to analyze the CFPB consumer complaints, and the proposed approach aims to extract latent topics in the consumer complaint narratives, and explores their associated trends over time.

...read moreread less

Abstract: The Consumer Financial Protection Bureau (CFPB), created by congress in 2011, receives and processes consumer complaints pertaining to various financial services. Every complaint narrative provides insight into problems that consumers are experiencing. With increasing number of the CFPB complaint narratives, manual review of these documents by human experts is not feasible. This requires an intelligent system to analyze narratives automatically and provide insightful knowledge to the experts. In this paper, we propose an intelligent approach based on latent Dirichlet allocation (LDA) to analyze the CFPB consumer complaints. The proposed approach aims to extract latent topics in the CFPB complaint narratives, and explores their associated trends over time. The time trends will then be used to evaluate the effectiveness of the CFPB regulations and expectations on financial institutions in creating a consumer oriented culture. The technology-human partnership between the proposed approach and the CFPB experts could certainly improve consumer experience by providing more efficient and effective investigations of consumer complaint narratives.

...read moreread less

Journal Article•DOI•

Uncovering latent jet substructure

[...]

Barry M. Dillon¹, Darius A. Faroughy¹, Jernej F. Kamenik²•Institutions (2)

Jožef Stefan Institute¹, University of Ljubljana²

03 Sep 2019-Physical Review D

TL;DR: In this paper, the authors apply techniques from Bayesian generative statistical modeling to uncover hidden features in jet substructure observables that discriminate between different a priori unknown underlying short distance physical processes in multijet events.

...read moreread less

Abstract: We apply techniques from Bayesian generative statistical modeling to uncover hidden features in jet substructure observables that discriminate between different a priori unknown underlying short distance physical processes in multijet events. In particular, we use a mixed membership model known as latent Dirichlet allocation to build a data-driven unsupervised top-quark tagger and $t\overline{t}$ event classifier. We compare our proposal to existing traditional and machine learning approaches to top-jet tagging. Finally, employing a toy vector-scalar boson model as a benchmark, we demonstrate the potential for discovering new physics signatures in multijet events in a model independent and unsupervised way.

...read moreread less

Journal Article•DOI•

Measuring service quality from unstructured data:A topic modeling application on airline passengers’ online reviews

[...]

Nikolaos Korfiatis¹, Panagiotis Stamolampros¹, Panos E. Kourouthanassis², Vasileios Sagiadinos³•Institutions (3)

University of East Anglia¹, Ionian University², Athens University of Economics and Business³

01 Feb 2019-Expert Systems With Applications

TL;DR: Online reviews are utilized to show the information gains from the consideration of factors identified from topic modeling of unstructured data which provide a flexible extension to numerical scores to understand customer satisfaction and subsequently service quality, explaining the success of low-cost carriers in the airline market.

...read moreread less

Abstract: Service quality is a multi-dimensional construct which is not accurately measured by aspects deriving from numerical ratings and their associated weights. Extant literature in the expert and intelligent systems examines this issue by relying mainly on such constrained information sets. In this study, we utilize online reviews to show the information gains from the consideration of factors identified from topic modeling of unstructured data which provide a flexible extension to numerical scores to understand customer satisfaction and subsequently service quality. When numerical and textual features are combined, the explained variation in overall satisfaction improves significantly. We further present how such information can be of value for firms for corporate strategy decision-making when incorporated in an expert system that acts as a tool to perform market analysis and assess their competitive performance. We apply our methodology on airline passengers’ online reviews using Structural Topic Models (STM), a recent probabilistic extension to Latent Dirichlet Allocation (LDA) that allows the incorporation of document level covariates. This innovation allows us to capture dominant drivers of satisfaction along with their dynamics and interdependencies. Results unveil the orthogonality of the low-cost aspect of airline competition when all other service quality dimensions are considered, thus explaining the success of low-cost carriers in the airline market.

...read moreread less

Journal Article•DOI•

Detecting indicators for startup business success: Sentiment analysis using text data mining

[...]

Jose Ramon Saura¹, Pedro R. Palos-Sanchez², Antonio Grilo•Institutions (2)

King Juan Carlos University¹, University of Seville²

11 Feb 2019-Sustainability

TL;DR: This research has detected that the topics with positive feelings for the identification of key factors for the startup business success are startup tools, technology-based startup, the attitude of the founders, and the startup methodology development.

...read moreread less

Abstract: The main aim of this study is to identify the key factors in User Generated Content (UGC) on the Twitter social network for the creation of successful startups, as well as to identify factors for sustainable startups and business models. New technologies were used in the proposed research methodology to identify the key factors for the success of startup projects. First, a Latent Dirichlet Allocation (LDA) model was used, which is a state-of-the-art thematic modeling tool that works in Python and determines the database topic by analyzing tweets for the #Startups hashtag on Twitter (n = 35.401 tweets). Secondly, a Sentiment Analysis was performed with a Supervised Vector Machine (SVM) algorithm that works with Machine Learning in Python. This was applied to the LDA results to divide the identified startup topics into negative, positive, and neutral sentiments. Thirdly, a Textual Analysis was carried out on the topics in each sentiment with Text Data Mining techniques using Nvivo software. This research has detected that the topics with positive feelings for the identification of key factors for the startup business success are startup tools, technology-based startup, the attitude of the founders, and the startup methodology development. The negative topics are the frameworks and programming languages, type of job offers, and the business angels’ requirements. The identified neutral topics are the development of the business plan, the type of startup project, and the incubator’s and startup’s geolocation. The limitations of the investigation are the number of tweets in the analyzed sample and the limited time horizon. Future lines of research could improve the methodology used to determine key factors for the creation of successful startups and could also study sustainable issues.

...read moreread less

Journal Article•DOI•

Semi-supervised machine-learning classification of materials synthesis procedures

[...]

Haoyan Huo¹, Haoyan Huo², Ziqin Rong², Olga Kononova², Wenhao Sun¹, Tiago Botari², Tiago Botari¹, Tanjin He¹, Tanjin He², Vahe Tshitoyan¹, Vahe Tshitoyan³, Gerbrand Ceder¹, Gerbrand Ceder² - Show less +9 more•Institutions (3)

Lawrence Berkeley National Laboratory¹, University of California, Berkeley², Google³

08 Jul 2019

TL;DR: This work demonstrates a semi-supervised machine-learning method to classify inorganic materials synthesis procedures from written natural language, and shows that a Markov chain representation of the order of experimental steps accurately reconstructs a flowchart of possible synthesis procedures.

...read moreread less

Abstract: Digitizing large collections of scientific literature can enable new informatics approaches for scientific analysis and meta-analysis. However, most content in the scientific literature is locked-up in written natural language, which is difficult to parse into databases using explicitly hard-coded classification rules. In this work, we demonstrate a semi-supervised machine-learning method to classify inorganic materials synthesis procedures from written natural language. Without any human input, latent Dirichlet allocation can cluster keywords into topics corresponding to specific experimental materials synthesis steps, such as “grinding” and “heating”, “dissolving” and “centrifuging”, etc. Guided by a modest amount of annotation, a random forest classifier can then associate these steps with different categories of materials synthesis, such as solid-state or hydrothermal synthesis. Finally, we show that a Markov chain representation of the order of experimental steps accurately reconstructs a flowchart of possible synthesis procedures. Our machine-learning approach enables a scalable approach to unlock the large amount of inorganic materials synthesis information from the literature and to process it into a standardized, machine-readable database.

...read moreread less

Journal Article•DOI•

Learning Stylometric Representations for Authorship Analysis

[...]

Steven H. H. Ding¹, Benjamin C. M. Fung¹, Farkhund Iqbal², William W. L. Cheung³•Institutions (3)

McGill University¹, Zayed University², Hong Kong Baptist University³

01 Jan 2019-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: This article proposed to incorporate different categories of linguistic features into distributed representation of words in order to learn simultaneously the writing style representations based on unlabeled texts for AA, which allows topical, lexical, syntactical, and character-level feature vectors of each document to be extracted as stylometrics.

...read moreread less

Abstract: Authorship analysis (AA) is the study of unveiling the hidden properties of authors from textual data. It extracts an author’s identity and sociolinguistic characteristics based on the reflected writing styles in the text. The process is essential for various areas, such as cybercrime investigation, psycholinguistics, political socialization, etc. However, most of the previous techniques critically depend on the manual feature engineering process. Consequently, the choice of feature set has been shown to be scenario- or dataset-dependent. In this paper, to mimic the human sentence composition process using a neural network approach, we propose to incorporate different categories of linguistic features into distributed representation of words in order to learn simultaneously the writing style representations based on unlabeled texts for AA. In particular, the proposed models allow topical, lexical, syntactical, and character-level feature vectors of each document to be extracted as stylometrics. We evaluate the performance of our approach on the problems of authorship characterization, authorship identification and authorship verification with the Twitter, blog, review, novel, and essay datasets. The experiments suggest that our proposed text representation outperforms the static stylometrics, dynamic ${n}$ -grams, latent Dirichlet allocation, latent semantic analysis, distributed memory model of paragraph vectors, distributed bag of words version of paragraph vector, word2vec representations, and other baselines.

...read moreread less

Journal Article•DOI•

Extracting Features of Entertainment Products: A Guided Latent Dirichlet Allocation Approach Informed by the Psychology of Media Consumption:

[...]

Olivier Toubia, Garud Iyengar, Renée Bunnell, Alain Lemaire

01 Feb 2019-Journal of Marketing Research

TL;DR: In this article, a quantitative approach for describing entertainment products, in a way that allows for improving the predictive performance of consumer choice models for these products, has been proposed to improve the prediction performance of these models.

...read moreread less

Abstract: The authors propose a quantitative approach for describing entertainment products, in a way that allows for improving the predictive performance of consumer choice models for these products. Their ...

...read moreread less

Journal Article•DOI•

Big Data Software Engineering: Analysis of Knowledge Domains and Skill Sets Using LDA-Based Topic Modeling

[...]

Fatih Gurcan¹, Nergiz Ercil Cagiltay²•Institutions (2)

Karadeniz Technical University¹, Atılım University²

20 Jun 2019-IEEE Access

TL;DR: The findings of this paper are expected to help evaluate and improve IT professionals’ vocational knowledge and skills, identify professional roles and competencies in personnel recruitment processes of companies, and meet the skill requirements of the industry through software engineering education programs.

...read moreread less

Abstract: Software engineering is a data-driven discipline and an integral part of data science. The introduction of big data systems has led to a great transformation in the architecture, methodologies, knowledge domains, and skills related to software engineering. Accordingly, education programs are now required to adapt themselves to up-to-date developments by first identifying the competencies concerning big data software engineering to meet the industrial needs and follow the latest trends. This paper aims to reveal the knowledge domains and skill sets required for big data software engineering and develop a taxonomy by mapping these competencies. A semi-automatic methodology is proposed for the semantic analysis of the textual contents of online job advertisements related to big data software engineering. This methodology uses the latent Dirichlet allocation (LDA), a probabilistic topic-modeling technique to discover the hidden semantic structures from a given textual corpus. The output of this paper is a systematic competency map comprising the essential knowledge domains, skills, and tools for big data software engineering. The findings of this paper are expected to help evaluate and improve IT professionals’ vocational knowledge and skills, identify professional roles and competencies in personnel recruitment processes of companies, and meet the skill requirements of the industry through software engineering education programs. Additionally, the proposed model can be extended to blogs, social networks, forums, and other online communities to allow automatic identification of emerging trends and generate contextual tags.

...read moreread less

Journal Article•DOI•

Understanding Individualization Driving States via Latent Dirichlet Allocation Model

[...]

Zhijun Chen¹, Yishi Zhang², Chaozhong Wu¹, Bin Ran³•Institutions (3)

Wuhan University of Technology¹, Jinan University², University of Wisconsin-Madison³

18 Mar 2019-IEEE Intelligent Transportation Systems Magazine

TL;DR: Latent Dirichlet Allocation Allocation (LDA) model is innovatively developed to understand latent driving states and quantified structure of the driving behavior patterns from individualization driving (documents) using driving behaviors (words).

...read moreread less

Abstract: Automatic driving technology has become one of the hottest research topics of the Intelligent Transportation System (ITS) and Artificial Intelligence (AI) in the recent years. The development of automatic driving technology can be promoted through understanding driving states of each driver (individualization driving). Although some methods for driving states understanding are proposed by previous studies, the latent driving states and structured driving behaviors has not yet been automatically discovered. The purpose of this study is to develop an unsupervised method for deeply understanding the individualization driving. First, an encode method is proposed to extracted driving behavior from vehicle motion data. Then, Latent Dirichlet Allocation (LDA) model is innovatively developed to understand latent driving states and quantified structure of the driving behavior patterns (topics) from individualization driving (documents) using driving behaviors (words). In order to validate the performance and effectiveness of the proposed method, twenty-two drivers (15 males and 7 females) were recruited to carry out road experiments in Wuhan, China for experiments data collection. In addition, two typical unsupervised methods, including k-means and the random method are established and their performances are compared in our experiments. Experimental results verify the superiority of proposed method compared with other methods.

...read moreread less

Journal Article•DOI•

The Tool for the Automatic Analysis of Cohesion 2.0: Integrating semantic similarity and text overlap

[...]

Scott A. Crossley¹, Kristopher Kyle, Mihai Dascalu²•Institutions (2)

Georgia State University¹, Politehnica University of Bucharest²

01 Feb 2019-Behavior Research Methods

TL;DR: This study examined whether source overlap between the speaking samples found in the TOEFL-iBT integrated speaking tasks and the responses produced by test-takers was predictive of human ratings of speaking proficiency, and found that global semantic similarity as reported by word2vec was an important predictor of coherence ratings.

...read moreread less

Abstract: This article introduces the second version of the Tool for the Automatic Analysis of Cohesion (TAACO 2.0). Like its predecessor, TAACO 2.0 is a freely available text analysis tool that works on the Windows, Mac, and Linux operating systems; is housed on a user's hard drive; is easy to use; and allows for batch processing of text files. TAACO 2.0 includes all the original indices reported for TAACO 1.0, but it adds a number of new indices related to local and global cohesion at the semantic level, reported by latent semantic analysis, latent Dirichlet allocation, and word2vec. The tool also includes a source overlap feature, which calculates lexical and semantic overlap between a source and a response text (i.e., cohesion between the two texts based measures of text relatedness). In the first study in this article, we examined the effects that cohesion features, prompt, essay elaboration, and enhanced cohesion had on expert ratings of text coherence, finding that global semantic similarity as reported by word2vec was an important predictor of coherence ratings. A second study was conducted to examine the source and response indices. In this study we examined whether source overlap between the speaking samples found in the TOEFL-iBT integrated speaking tasks and the responses produced by test-takers was predictive of human ratings of speaking proficiency. The results indicated that the percentage of keywords found in both the source and response and the similarity between the source document and the response, as reported by word2vec, were significant predictors of speaking quality. Combined, these findings help validate the new indices reported for TAACO 2.0.

...read moreread less

Journal Article•DOI•

The utility of topic modelling for discourse studies: A critical evaluation:

[...]

Gavin Brookes¹, Tony McEnery¹•Institutions (1)

Lancaster University¹

01 Feb 2019-Discourse Studies

TL;DR: It was found that the topic modelling approach was able to group texts into ‘topics’ that were truly thematically coherent with a mixed degree of success, while the more traditional approach to discourse analysis consistently provided a more nuanced perspective on the data which was ultimately closer to the ‘reality’ of the texts it contains.

...read moreread less

Abstract: This article explores and critically evaluates the potential contribution to discourse studies of topic modelling, a group of machine learning methods which have been used with the aim of automatic...

...read moreread less

Journal Article•DOI•

Network text sentiment analysis method combining LDA text representation and GRU-CNN

[...]

Li-xia Luo

01 Jul 2019-Personal and Ubiquitous Computing

TL;DR: A text sentiment analysis method combining Latent Dirichlet Allocation text representation and convolutional neural network (CNN) that can effectively improve the accuracy of text sentiment classification.

...read moreread less

Abstract: In order to improve the performance of internet public sentiment analysis, a text sentiment analysis method combining Latent Dirichlet Allocation (LDA) text representation and convolutional neural network (CNN) is proposed. First, the review texts are collected from the network for preprocessing. Then, using the LDA topic model to train the latent semantic space representation (topic distribution) of the short text, and the short text feature vector representation based on the topic distribution is constructed. Finally, the CNN with gated recurrent unit (GRU) is used as a classifier. According to the input feature matrix, the GRU-CNN strengthens the relationship between words and words, text and text, so as to achieve high accurate text classification. The simulation results show that this method can effectively improve the accuracy of text sentiment classification.

...read moreread less

Journal Article•DOI•

Development of a medical big-data mining process using topic modeling

[...]

Chang-Woo Song¹, Hoill Jung, Kyung-Yong Chung²•Institutions (2)

Inha University¹, Kyonggi University²

01 Jan 2019-Cluster Computing

TL;DR: This study proposes the development of a medical big-data mining process for which topic modeling is employed, and a performance evaluation of the topic-modeling accuracy based on the medicalbig- data mining process and the topic -modeling efficiency was examined.

...read moreread less

Abstract: With the development of convergence information technology, all of the spaces and objects of human living have become digitized. In the health- and medical-service areas, IT supports Internet of things (IoT)-based medical services and health-care systems for patients. Medical facilities have been advanced on the basis of such IoT devices, and the digitized information on human behaviors and health makes the delivery of efficient and convenient health care possible. Under the given circumstances, health and medical care have been researched. For some of this research, the patient-health data were collected using IoT-based medical devices, and they served as a tool for medical diagnosis and treatment. This study proposes the development of a medical big-data mining process for which topic modeling is employed. The proposed method uses the big data that are offered by the open system of the health- and medical-services big data from the Health Insurance Review and Assessment Service, and their application follows the guidelines of the knowledge discovery in big-data process for data mining and topic modeling. For the medical data regarding the topic modeling, the public structured health- and medical-services big data, Open API, and patient datasets were used. For the document classification in the semantic situation of a topic, the Bag of Words technique and the latent Dirichlet allocation method were applied to find the document association for the development of the medical big-data mining process. In addition, this study conducted a performance evaluation of the topic-modeling accuracy based on the medical big-data mining process and the topic-modeling efficiency, and the effectiveness of the proposed method was examined.

...read moreread less

Posted Content•

The Dynamic Embedded Topic Model.

[...]

Adji B. Dieng, Francisco J. R. Ruiz, David M. Blei

01 Jan 2019-arXiv: Computation and Language

TL;DR: The dynamic embedded topic model (D-ETM) is developed, a generative model of documents that combines dynamic latent Dirichlet allocation and word embeddings that outperforms D-LDA on a document completion task and learns more diverse and coherent topics while requiring significantly less time to fit.

...read moreread less

Abstract: Topic modeling analyzes documents to learn meaningful patterns of words. For documents collected in sequence, dynamic topic models capture how these patterns vary over time. We develop the dynamic embedded topic model (D-ETM), a generative model of documents that combines dynamic latent Dirichlet allocation (D-LDA) and word embeddings. The D-ETM models each word with a categorical distribution parameterized by the inner product between the word embedding and a per-time-step embedding representation of its assigned topic. The D-ETM learns smooth topic trajectories by defining a random walk prior over the embedding representations of the topics. We fit the D-ETM using structured amortized variational inference with a recurrent neural network. On three different corpora---a collection of United Nations debates, a set of ACL abstracts, and a dataset of Science Magazine articles---we found that the D-ETM outperforms D-LDA on a document completion task. We further found that the D-ETM learns more diverse and coherent topics than D-LDA while requiring significantly less time to fit.

...read moreread less

Journal Article•DOI•

The Current Research Landscape on the Artificial Intelligence Application in the Management of Depressive Disorders: A Bibliometric Analysis

[...]

Bach Xuan Tran¹, Bach Xuan Tran², Roger S. McIntyre, Carl A. Latkin¹, Hai Thanh Phan³, Giang Thu Vu, Huong Lan Thi Nguyen³, Kenneth K. Gwee⁴, Cyrus S.H. Ho, Roger C.M. Ho⁴ - Show less +6 more•Institutions (4)

Johns Hopkins University¹, Hanoi Medical University², Duy Tan University³, National University of Singapore⁴

18 Jun 2019-International Journal of Environmental Research and Public Health

TL;DR: A bibliometric analysis of the current research landscape, which objectively evaluates the productivity of global researchers or institutions in this field, along with exploratory factor analysis (EFA) and latent dirichlet allocation (LDA), showed that the most well-studied application of AI was the utilization of machine learning to identify clinical characteristics in depression, which accounted for more than 60% of all publications.

...read moreread less

Abstract: Artificial intelligence (AI)-based techniques have been widely applied in depression research and treatment. Nonetheless, there is currently no systematic review or bibliometric analysis in the medical literature about the applications of AI in depression. We performed a bibliometric analysis of the current research landscape, which objectively evaluates the productivity of global researchers or institutions in this field, along with exploratory factor analysis (EFA) and latent dirichlet allocation (LDA). From 2010 onwards, the total number of papers and citations on using AI to manage depressive disorder have risen considerably. In terms of global AI research network, researchers from the United States were the major contributors to this field. Exploratory factor analysis showed that the most well-studied application of AI was the utilization of machine learning to identify clinical characteristics in depression, which accounted for more than 60% of all publications. Latent dirichlet allocation identified specific research themes, which include diagnosis accuracy, structural imaging techniques, gene testing, drug development, pattern recognition, and electroencephalography (EEG)-based diagnosis. Although the rapid development and widespread use of AI provide various benefits for both health providers and patients, interventions to enhance privacy and confidentiality issues are still limited and require further research.

...read moreread less

Journal Article•DOI•

A new semantic-based feature selection method for spam filtering

[...]

José Ramon Méndez¹, Tomás R. Cotos-Yáñez, David Ruano-Ordás¹•Institutions (1)

University of Vigo¹

01 Mar 2019-Applied Soft Computing

TL;DR: This study introduces a new feature selection method able to take advantage of a semantic ontology to group words into topics and use them to build feature vectors, and shows the suitability and additional benefits of topic-driven methods to develop and deploy high-performance spam filters.

...read moreread less

Journal Article•DOI•

Research on Topic Detection and Tracking for Online News Texts

[...]

Guixian Xu¹, Yueting Meng¹, Zhan Chen¹, Xiaoyu Qiu², Changzhi Wang³, Haishen Yao¹ - Show less +2 more•Institutions (3)

Minzu University of China¹, Shandong University of Traditional Chinese Medicine², Baidu³

30 Apr 2019-IEEE Access

TL;DR: A method for the evolution of news topics over time is proposed in this paper to realize the tracking and evolution of topics in the news text set and can effectively detect and track the topic and clearly reflect the trend of topic evolution.

...read moreread less

Abstract: With the rapid development of the Internet, the amount of data has grown exponentially. On the one hand, the accumulation of big data provides the basic support for artificial intelligence. On the other hand, in the face of such huge data information, how to extract the knowledge of interest from it has become a matter of general concern. Topic tracking can help people to explore the process of topic development from the huge and complex network texts information. By effectively organizing large-scale news documents, a method for the evolution of news topics over time is proposed in this paper to realize the tracking and evolution of topics in the news text set. First, the LDA (latent Dirichlet allocation) model is used to extract topics from news texts and the Gibbs Sampling method is used to speculate parameters. The topic mining using the K-means method is compared to highlight the advantages of using LDA for topic discovery. Second, the improved single-pass algorithm is used to track news topics. The JS (Jensen-Shannon) divergence is used to measure the topic similarity, and the time decay function is introduced to improve the similarity between topics with the similar time. Finally, the strength of the news topic and the content change of the topic in different time windows are analyzed. The experiments show that the proposed method can effectively detect and track the topic and clearly reflect the trend of topic evolution.

...read moreread less

Journal Article•DOI•

Incorporating word embeddings into topic modeling of short text

[...]

Wang Gao¹, Min Peng¹, Hua Wang², Yanchun Zhang², Qianqian Xie¹, Gang Tian¹ - Show less +2 more•Institutions (2)

Wuhan University¹, Victoria University, Australia²

01 Nov 2019-Knowledge and Information Systems

TL;DR: A novel model for short text topic modeling, referred as Conditional Random Field regularized Topic Model (CRFTM), which not only develops a generalized solution to alleviate the sparsity problem by aggregating short texts into pseudo-documents, but also leverages a Conditional random field regularized model that encourages semantically related words to share the same topic assignment.

...read moreread less

Abstract: Short texts have become the prevalent format of information on the Internet. Inferring the topics of this type of messages becomes a critical and challenging task for many applications. Due to the length of short texts, conventional topic models (e.g., latent Dirichlet allocation and its variants) suffer from the severe data sparsity problem which makes topic modeling of short texts difficult and unreliable. Recently, word embeddings have been proved effective to capture semantic and syntactic information about words, which can be used to induce similarity measures and semantic correlations among words. Enlightened by this, in this paper, we design a novel model for short text topic modeling, referred as Conditional Random Field regularized Topic Model (CRFTM). CRFTM not only develops a generalized solution to alleviate the sparsity problem by aggregating short texts into pseudo-documents, but also leverages a Conditional Random Field regularized model that encourages semantically related words to share the same topic assignment. Experimental results on two real-world datasets show that our method can extract more coherent topics, and significantly outperform state-of-the-art baselines on several evaluation metrics.

...read moreread less

Journal Article•DOI•

Topic-based knowledge mining of online student reviews for strategic planning in universities

[...]

Sharan Srinivas¹, Suchithra Rajendran¹•Institutions (1)

University of Missouri¹

01 Feb 2019-Computers & Industrial Engineering

TL;DR: The results indicate that the proposed method provides an efficient and economic performance summary of a university and its competitors, and could help its leaders in recruitment and retention efforts.

...read moreread less

Journal Article•DOI•

A Three-Stage method for Data Text Mining: Using UGC in Business Intelligence Analysis

[...]

Jose Ramon Saura, Dag Bennett

10 Apr 2019-Symmetry

TL;DR: A three-step research methodology based on data text mining (DTM) that can be used for business intelligence analysis (BIA) strategies to analyze user generated content (UGC) in social networks and on digital platforms is proposed.

...read moreread less

Abstract: The global development of the Internet, which enabled the analysis of large amounts of data and the services linked to their use, has led companies to modify their business strategies in search of new ways to increase marketing productivity and profitability. Many strategies are based on business intelligence (BI) and marketing intelligence (MI) that make it possible to extract profitable knowledge and insights from large amounts of data generated by company customers in digital environments. In this context, the present study proposes a three-step research methodology based on data text mining (DTM). In further research, this methodology can be used for business intelligence analysis (BIA) strategies to analyze user generated content (UGC) in social networks and on digital platforms. The proposed methodology unfolds in the following three stages. First, a Latent Dirichlet Allocation (LDA) model that determines the database topic is used. Second, a sentiment analysis (SA) is proposed. This SA is applied to the LDA results to divide the topics identified in the sample into negative, positive, and neutral sentiments. Thirdly, textual analysis (TA) with data text mining techniques is applied on the topics in each sentiment. The proposed methodology offers important advances in data text mining in terms of accuracy, reliability and insight generation for both researchers and practitioners seeking to improve the BIA processes in business and other sectors.

...read moreread less

Collapse