scispace - formally typeset
Search or ask a question

Showing papers on "Latent Dirichlet allocation published in 2019"


Journal ArticleDOI
TL;DR: In this article, the authors investigated highly scholarly articles (between 2003 to 2016) related to topic modeling based on LDA to discover the research development, current trends and intellectual structure of topic modeling.
Abstract: Topic modeling is one of the most powerful techniques in text mining for data mining, latent data discovery, and finding relationships among data and text documents. Researchers have published many articles in the field of topic modeling and applied in various fields such as software engineering, political science, medical and linguistic science, etc. There are various methods for topic modelling; Latent Dirichlet Allocation (LDA) is one of the most popular in this field. Researchers have proposed various models based on the LDA in topic modeling. According to previous work, this paper will be very useful and valuable for introducing LDA approaches in topic modeling. In this paper, we investigated highly scholarly articles (between 2003 to 2016) related to topic modeling based on LDA to discover the research development, current trends and intellectual structure of topic modeling. In addition, we summarize challenges and introduce famous tools and datasets in topic modeling based on LDA.

608 citations


Journal ArticleDOI
TL;DR: This paper transforms a document using three document representation methods: term frequency–inverse document frequency (TF–IDF) based on the bag-of-words scheme, topic distribution based on latent Dirichlet allocation (LDA), and neural-network-based document embedding known as document to vector (Doc2Vec).

270 citations


Journal ArticleDOI
TL;DR: The results indicate that the proposed methodology can obtain effective analysis results with lower cost and shorter time since online reviews are publicly available and easily collected.

165 citations


Journal ArticleDOI
TL;DR: A methodology that can analyse online reviews using machine learning techniques in such a way that practitioners in the fields of tourism and destination management can understand and apply the technique to improve their attractions is developed.

165 citations


Journal ArticleDOI
TL;DR: A text-mining approach using a Bayesian statistical topic model called latent Dirichlet allocation is employed to conduct a comprehensive analysis of 150 articles from 115 journals, revealing seven relevant topics.

162 citations


Journal ArticleDOI
TL;DR: The aim of the paper is to enable the use of topic modelling for researchers by presenting a step-by-step framework on a case and sharing a code template, which enables huge amounts of papers to be reviewed in a transparent, reliable, faster, and reproducible way.
Abstract: Manual exploratory literature reviews should be a thing of the past, as technology and development of machine learning methods have matured. The learning curve for using machine learning methods is rapidly declining, enabling new possibilities for all researchers. A framework is presented on how to use topic modelling on a large collection of papers for an exploratory literature review and how that can be used for a full literature review. The aim of the paper is to enable the use of topic modelling for researchers by presenting a step-by-step framework on a case and sharing a code template. The framework consists of three steps; pre-processing, topic modelling, and post-processing, where the topic model Latent Dirichlet Allocation is used. The framework enables huge amounts of papers to be reviewed in a transparent, reliable, faster, and reproducible way.

156 citations


Journal ArticleDOI
TL;DR: A large dataset of geotagged tweets containing certain keywords relating to climate change is analyzed using volume analysis and text mining techniques such as topic modeling and sentiment analysis to compare and contrast the nature of climate change discussion between different countries and over time.
Abstract: Social media websites can be used as a data source for mining public opinion on a variety of subjects including climate change. Twitter, in particular, allows for the evaluation of public opinion across both time and space because geotagged tweets include timestamps and geographic coordinates (latitude/longitude). In this study, a large dataset of geotagged tweets containing certain keywords relating to climate change is analyzed using volume analysis and text mining techniques such as topic modeling and sentiment analysis. Latent Dirichlet allocation was applied for topic modeling to infer the different topics of discussion, and Valence Aware Dictionary and sEntiment Reasoner was applied for sentiment analysis to determine the overall feelings and attitudes found in the dataset. These techniques are used to compare and contrast the nature of climate change discussion between different countries and over time. Sentiment analysis shows that the overall discussion is negative, especially when users are reacting to political or extreme weather events. Topic modeling shows that the different topics of discussion on climate change are diverse, but some topics are more prevalent than others. In particular, the discussion of climate change in the USA is less focused on policy-related topics than other countries.

151 citations


Journal ArticleDOI
TL;DR: This study proposes a feature grouping method based on the Latent Dirichlet Allocation (LDA) topic model for distinguishing effects from various online news topics and suggests that the proposed topic-sentiment synthesis forecasting models perform better than the older benchmark models.

128 citations


Journal ArticleDOI
TL;DR: A research paper classification system that can cluster research papers into the meaningful class in which papers are very likely to have similar subjects is proposed.
Abstract: With the increasing advance of computer and information technologies, numerous research papers have been published online as well as offline, and as new research fields have been continuingly created, users have a lot of trouble in finding and categorizing their interesting research papers. In order to overcome the limitations, this paper proposes a research paper classification system that can cluster research papers into the meaningful class in which papers are very likely to have similar subjects. The proposed system extracts representative keywords from the abstracts of each paper and topics by Latent Dirichlet allocation (LDA) scheme. Then, the K-means clustering algorithm is applied to classify the whole papers into research papers with similar subjects, based on the Term frequency-inverse document frequency (TF-IDF) values of each paper.

118 citations


Journal ArticleDOI
TL;DR: This work proposes an ontology and latent Dirichlet allocation (OLDA)-based topic modeling and word embedding approach for sentiment classification, which achieves accuracy of 93%, which shows that the proposed approach is effective for sentiment Classification.
Abstract: Social networks play a key role in providing a new approach to collecting information regarding mobility and transportation services. To study this information, sentiment analysis can make decent observations to support intelligent transportation systems (ITSs) in examining traffic control and management systems. However, sentiment analysis faces technical challenges: extracting meaningful information from social network platforms, and the transformation of extracted data into valuable information. In addition, accurate topic modeling and document representation are other challenging tasks in sentiment analysis. We propose an ontology and latent Dirichlet allocation (OLDA)-based topic modeling and word embedding approach for sentiment classification. The proposed system retrieves transportation content from social networks, removes irrelevant content to extract meaningful information, and generates topics and features from extracted data using OLDA. It also represents documents using word embedding techniques, and then employs lexicon-based approaches to enhance the accuracy of the word embedding model. The proposed ontology and the intelligent model are developed using Web Ontology Language and Java, respectively. Machine learning classifiers are used to evaluate the proposed word embedding system. The method achieves accuracy of 93%, which shows that the proposed approach is effective for sentiment classification.

113 citations


Journal ArticleDOI
TL;DR: In this paper, an intelligent approach based on latent Dirichlet allocation (LDA) was proposed to analyze the CFPB consumer complaints, and the proposed approach aims to extract latent topics in the consumer complaint narratives, and explores their associated trends over time.
Abstract: The Consumer Financial Protection Bureau (CFPB), created by congress in 2011, receives and processes consumer complaints pertaining to various financial services. Every complaint narrative provides insight into problems that consumers are experiencing. With increasing number of the CFPB complaint narratives, manual review of these documents by human experts is not feasible. This requires an intelligent system to analyze narratives automatically and provide insightful knowledge to the experts. In this paper, we propose an intelligent approach based on latent Dirichlet allocation (LDA) to analyze the CFPB consumer complaints. The proposed approach aims to extract latent topics in the CFPB complaint narratives, and explores their associated trends over time. The time trends will then be used to evaluate the effectiveness of the CFPB regulations and expectations on financial institutions in creating a consumer oriented culture. The technology-human partnership between the proposed approach and the CFPB experts could certainly improve consumer experience by providing more efficient and effective investigations of consumer complaint narratives.

Journal ArticleDOI
TL;DR: In this paper, the authors apply techniques from Bayesian generative statistical modeling to uncover hidden features in jet substructure observables that discriminate between different a priori unknown underlying short distance physical processes in multijet events.
Abstract: We apply techniques from Bayesian generative statistical modeling to uncover hidden features in jet substructure observables that discriminate between different a priori unknown underlying short distance physical processes in multijet events. In particular, we use a mixed membership model known as latent Dirichlet allocation to build a data-driven unsupervised top-quark tagger and $t\overline{t}$ event classifier. We compare our proposal to existing traditional and machine learning approaches to top-jet tagging. Finally, employing a toy vector-scalar boson model as a benchmark, we demonstrate the potential for discovering new physics signatures in multijet events in a model independent and unsupervised way.

Journal ArticleDOI
TL;DR: Online reviews are utilized to show the information gains from the consideration of factors identified from topic modeling of unstructured data which provide a flexible extension to numerical scores to understand customer satisfaction and subsequently service quality, explaining the success of low-cost carriers in the airline market.
Abstract: Service quality is a multi-dimensional construct which is not accurately measured by aspects deriving from numerical ratings and their associated weights. Extant literature in the expert and intelligent systems examines this issue by relying mainly on such constrained information sets. In this study, we utilize online reviews to show the information gains from the consideration of factors identified from topic modeling of unstructured data which provide a flexible extension to numerical scores to understand customer satisfaction and subsequently service quality. When numerical and textual features are combined, the explained variation in overall satisfaction improves significantly. We further present how such information can be of value for firms for corporate strategy decision-making when incorporated in an expert system that acts as a tool to perform market analysis and assess their competitive performance. We apply our methodology on airline passengers’ online reviews using Structural Topic Models (STM), a recent probabilistic extension to Latent Dirichlet Allocation (LDA) that allows the incorporation of document level covariates. This innovation allows us to capture dominant drivers of satisfaction along with their dynamics and interdependencies. Results unveil the orthogonality of the low-cost aspect of airline competition when all other service quality dimensions are considered, thus explaining the success of low-cost carriers in the airline market.

Journal ArticleDOI
TL;DR: This research has detected that the topics with positive feelings for the identification of key factors for the startup business success are startup tools, technology-based startup, the attitude of the founders, and the startup methodology development.
Abstract: The main aim of this study is to identify the key factors in User Generated Content (UGC) on the Twitter social network for the creation of successful startups, as well as to identify factors for sustainable startups and business models. New technologies were used in the proposed research methodology to identify the key factors for the success of startup projects. First, a Latent Dirichlet Allocation (LDA) model was used, which is a state-of-the-art thematic modeling tool that works in Python and determines the database topic by analyzing tweets for the #Startups hashtag on Twitter (n = 35.401 tweets). Secondly, a Sentiment Analysis was performed with a Supervised Vector Machine (SVM) algorithm that works with Machine Learning in Python. This was applied to the LDA results to divide the identified startup topics into negative, positive, and neutral sentiments. Thirdly, a Textual Analysis was carried out on the topics in each sentiment with Text Data Mining techniques using Nvivo software. This research has detected that the topics with positive feelings for the identification of key factors for the startup business success are startup tools, technology-based startup, the attitude of the founders, and the startup methodology development. The negative topics are the frameworks and programming languages, type of job offers, and the business angels’ requirements. The identified neutral topics are the development of the business plan, the type of startup project, and the incubator’s and startup’s geolocation. The limitations of the investigation are the number of tweets in the analyzed sample and the limited time horizon. Future lines of research could improve the methodology used to determine key factors for the creation of successful startups and could also study sustainable issues.

Journal ArticleDOI
08 Jul 2019
TL;DR: This work demonstrates a semi-supervised machine-learning method to classify inorganic materials synthesis procedures from written natural language, and shows that a Markov chain representation of the order of experimental steps accurately reconstructs a flowchart of possible synthesis procedures.
Abstract: Digitizing large collections of scientific literature can enable new informatics approaches for scientific analysis and meta-analysis. However, most content in the scientific literature is locked-up in written natural language, which is difficult to parse into databases using explicitly hard-coded classification rules. In this work, we demonstrate a semi-supervised machine-learning method to classify inorganic materials synthesis procedures from written natural language. Without any human input, latent Dirichlet allocation can cluster keywords into topics corresponding to specific experimental materials synthesis steps, such as “grinding” and “heating”, “dissolving” and “centrifuging”, etc. Guided by a modest amount of annotation, a random forest classifier can then associate these steps with different categories of materials synthesis, such as solid-state or hydrothermal synthesis. Finally, we show that a Markov chain representation of the order of experimental steps accurately reconstructs a flowchart of possible synthesis procedures. Our machine-learning approach enables a scalable approach to unlock the large amount of inorganic materials synthesis information from the literature and to process it into a standardized, machine-readable database.

Journal ArticleDOI
TL;DR: This article proposed to incorporate different categories of linguistic features into distributed representation of words in order to learn simultaneously the writing style representations based on unlabeled texts for AA, which allows topical, lexical, syntactical, and character-level feature vectors of each document to be extracted as stylometrics.
Abstract: Authorship analysis (AA) is the study of unveiling the hidden properties of authors from textual data. It extracts an author’s identity and sociolinguistic characteristics based on the reflected writing styles in the text. The process is essential for various areas, such as cybercrime investigation, psycholinguistics, political socialization, etc. However, most of the previous techniques critically depend on the manual feature engineering process. Consequently, the choice of feature set has been shown to be scenario- or dataset-dependent. In this paper, to mimic the human sentence composition process using a neural network approach, we propose to incorporate different categories of linguistic features into distributed representation of words in order to learn simultaneously the writing style representations based on unlabeled texts for AA. In particular, the proposed models allow topical, lexical, syntactical, and character-level feature vectors of each document to be extracted as stylometrics. We evaluate the performance of our approach on the problems of authorship characterization, authorship identification and authorship verification with the Twitter, blog, review, novel, and essay datasets. The experiments suggest that our proposed text representation outperforms the static stylometrics, dynamic ${n}$ -grams, latent Dirichlet allocation, latent semantic analysis, distributed memory model of paragraph vectors, distributed bag of words version of paragraph vector, word2vec representations, and other baselines.

Journal ArticleDOI
TL;DR: In this article, a quantitative approach for describing entertainment products, in a way that allows for improving the predictive performance of consumer choice models for these products, has been proposed to improve the prediction performance of these models.
Abstract: The authors propose a quantitative approach for describing entertainment products, in a way that allows for improving the predictive performance of consumer choice models for these products. Their ...

Journal ArticleDOI
TL;DR: The findings of this paper are expected to help evaluate and improve IT professionals’ vocational knowledge and skills, identify professional roles and competencies in personnel recruitment processes of companies, and meet the skill requirements of the industry through software engineering education programs.
Abstract: Software engineering is a data-driven discipline and an integral part of data science. The introduction of big data systems has led to a great transformation in the architecture, methodologies, knowledge domains, and skills related to software engineering. Accordingly, education programs are now required to adapt themselves to up-to-date developments by first identifying the competencies concerning big data software engineering to meet the industrial needs and follow the latest trends. This paper aims to reveal the knowledge domains and skill sets required for big data software engineering and develop a taxonomy by mapping these competencies. A semi-automatic methodology is proposed for the semantic analysis of the textual contents of online job advertisements related to big data software engineering. This methodology uses the latent Dirichlet allocation (LDA), a probabilistic topic-modeling technique to discover the hidden semantic structures from a given textual corpus. The output of this paper is a systematic competency map comprising the essential knowledge domains, skills, and tools for big data software engineering. The findings of this paper are expected to help evaluate and improve IT professionals’ vocational knowledge and skills, identify professional roles and competencies in personnel recruitment processes of companies, and meet the skill requirements of the industry through software engineering education programs. Additionally, the proposed model can be extended to blogs, social networks, forums, and other online communities to allow automatic identification of emerging trends and generate contextual tags.

Journal ArticleDOI
TL;DR: Latent Dirichlet Allocation Allocation (LDA) model is innovatively developed to understand latent driving states and quantified structure of the driving behavior patterns from individualization driving (documents) using driving behaviors (words).
Abstract: Automatic driving technology has become one of the hottest research topics of the Intelligent Transportation System (ITS) and Artificial Intelligence (AI) in the recent years. The development of automatic driving technology can be promoted through understanding driving states of each driver (individualization driving). Although some methods for driving states understanding are proposed by previous studies, the latent driving states and structured driving behaviors has not yet been automatically discovered. The purpose of this study is to develop an unsupervised method for deeply understanding the individualization driving. First, an encode method is proposed to extracted driving behavior from vehicle motion data. Then, Latent Dirichlet Allocation (LDA) model is innovatively developed to understand latent driving states and quantified structure of the driving behavior patterns (topics) from individualization driving (documents) using driving behaviors (words). In order to validate the performance and effectiveness of the proposed method, twenty-two drivers (15 males and 7 females) were recruited to carry out road experiments in Wuhan, China for experiments data collection. In addition, two typical unsupervised methods, including k-means and the random method are established and their performances are compared in our experiments. Experimental results verify the superiority of proposed method compared with other methods.

Journal ArticleDOI
TL;DR: This study examined whether source overlap between the speaking samples found in the TOEFL-iBT integrated speaking tasks and the responses produced by test-takers was predictive of human ratings of speaking proficiency, and found that global semantic similarity as reported by word2vec was an important predictor of coherence ratings.
Abstract: This article introduces the second version of the Tool for the Automatic Analysis of Cohesion (TAACO 2.0). Like its predecessor, TAACO 2.0 is a freely available text analysis tool that works on the Windows, Mac, and Linux operating systems; is housed on a user's hard drive; is easy to use; and allows for batch processing of text files. TAACO 2.0 includes all the original indices reported for TAACO 1.0, but it adds a number of new indices related to local and global cohesion at the semantic level, reported by latent semantic analysis, latent Dirichlet allocation, and word2vec. The tool also includes a source overlap feature, which calculates lexical and semantic overlap between a source and a response text (i.e., cohesion between the two texts based measures of text relatedness). In the first study in this article, we examined the effects that cohesion features, prompt, essay elaboration, and enhanced cohesion had on expert ratings of text coherence, finding that global semantic similarity as reported by word2vec was an important predictor of coherence ratings. A second study was conducted to examine the source and response indices. In this study we examined whether source overlap between the speaking samples found in the TOEFL-iBT integrated speaking tasks and the responses produced by test-takers was predictive of human ratings of speaking proficiency. The results indicated that the percentage of keywords found in both the source and response and the similarity between the source document and the response, as reported by word2vec, were significant predictors of speaking quality. Combined, these findings help validate the new indices reported for TAACO 2.0.

Journal ArticleDOI
TL;DR: It was found that the topic modelling approach was able to group texts into ‘topics’ that were truly thematically coherent with a mixed degree of success, while the more traditional approach to discourse analysis consistently provided a more nuanced perspective on the data which was ultimately closer to the ‘reality’ of the texts it contains.
Abstract: This article explores and critically evaluates the potential contribution to discourse studies of topic modelling, a group of machine learning methods which have been used with the aim of automatic...

Journal ArticleDOI
TL;DR: A text sentiment analysis method combining Latent Dirichlet Allocation text representation and convolutional neural network (CNN) that can effectively improve the accuracy of text sentiment classification.
Abstract: In order to improve the performance of internet public sentiment analysis, a text sentiment analysis method combining Latent Dirichlet Allocation (LDA) text representation and convolutional neural network (CNN) is proposed. First, the review texts are collected from the network for preprocessing. Then, using the LDA topic model to train the latent semantic space representation (topic distribution) of the short text, and the short text feature vector representation based on the topic distribution is constructed. Finally, the CNN with gated recurrent unit (GRU) is used as a classifier. According to the input feature matrix, the GRU-CNN strengthens the relationship between words and words, text and text, so as to achieve high accurate text classification. The simulation results show that this method can effectively improve the accuracy of text sentiment classification.

Journal ArticleDOI
TL;DR: This study proposes the development of a medical big-data mining process for which topic modeling is employed, and a performance evaluation of the topic-modeling accuracy based on the medicalbig- data mining process and the topic -modeling efficiency was examined.
Abstract: With the development of convergence information technology, all of the spaces and objects of human living have become digitized. In the health- and medical-service areas, IT supports Internet of things (IoT)-based medical services and health-care systems for patients. Medical facilities have been advanced on the basis of such IoT devices, and the digitized information on human behaviors and health makes the delivery of efficient and convenient health care possible. Under the given circumstances, health and medical care have been researched. For some of this research, the patient-health data were collected using IoT-based medical devices, and they served as a tool for medical diagnosis and treatment. This study proposes the development of a medical big-data mining process for which topic modeling is employed. The proposed method uses the big data that are offered by the open system of the health- and medical-services big data from the Health Insurance Review and Assessment Service, and their application follows the guidelines of the knowledge discovery in big-data process for data mining and topic modeling. For the medical data regarding the topic modeling, the public structured health- and medical-services big data, Open API, and patient datasets were used. For the document classification in the semantic situation of a topic, the Bag of Words technique and the latent Dirichlet allocation method were applied to find the document association for the development of the medical big-data mining process. In addition, this study conducted a performance evaluation of the topic-modeling accuracy based on the medical big-data mining process and the topic-modeling efficiency, and the effectiveness of the proposed method was examined.

Posted Content
TL;DR: The dynamic embedded topic model (D-ETM) is developed, a generative model of documents that combines dynamic latent Dirichlet allocation and word embeddings that outperforms D-LDA on a document completion task and learns more diverse and coherent topics while requiring significantly less time to fit.
Abstract: Topic modeling analyzes documents to learn meaningful patterns of words. For documents collected in sequence, dynamic topic models capture how these patterns vary over time. We develop the dynamic embedded topic model (D-ETM), a generative model of documents that combines dynamic latent Dirichlet allocation (D-LDA) and word embeddings. The D-ETM models each word with a categorical distribution parameterized by the inner product between the word embedding and a per-time-step embedding representation of its assigned topic. The D-ETM learns smooth topic trajectories by defining a random walk prior over the embedding representations of the topics. We fit the D-ETM using structured amortized variational inference with a recurrent neural network. On three different corpora---a collection of United Nations debates, a set of ACL abstracts, and a dataset of Science Magazine articles---we found that the D-ETM outperforms D-LDA on a document completion task. We further found that the D-ETM learns more diverse and coherent topics than D-LDA while requiring significantly less time to fit.

Journal ArticleDOI
TL;DR: A bibliometric analysis of the current research landscape, which objectively evaluates the productivity of global researchers or institutions in this field, along with exploratory factor analysis (EFA) and latent dirichlet allocation (LDA), showed that the most well-studied application of AI was the utilization of machine learning to identify clinical characteristics in depression, which accounted for more than 60% of all publications.
Abstract: Artificial intelligence (AI)-based techniques have been widely applied in depression research and treatment. Nonetheless, there is currently no systematic review or bibliometric analysis in the medical literature about the applications of AI in depression. We performed a bibliometric analysis of the current research landscape, which objectively evaluates the productivity of global researchers or institutions in this field, along with exploratory factor analysis (EFA) and latent dirichlet allocation (LDA). From 2010 onwards, the total number of papers and citations on using AI to manage depressive disorder have risen considerably. In terms of global AI research network, researchers from the United States were the major contributors to this field. Exploratory factor analysis showed that the most well-studied application of AI was the utilization of machine learning to identify clinical characteristics in depression, which accounted for more than 60% of all publications. Latent dirichlet allocation identified specific research themes, which include diagnosis accuracy, structural imaging techniques, gene testing, drug development, pattern recognition, and electroencephalography (EEG)-based diagnosis. Although the rapid development and widespread use of AI provide various benefits for both health providers and patients, interventions to enhance privacy and confidentiality issues are still limited and require further research.

Journal ArticleDOI
TL;DR: This study introduces a new feature selection method able to take advantage of a semantic ontology to group words into topics and use them to build feature vectors, and shows the suitability and additional benefits of topic-driven methods to develop and deploy high-performance spam filters.

Journal ArticleDOI
TL;DR: A method for the evolution of news topics over time is proposed in this paper to realize the tracking and evolution of topics in the news text set and can effectively detect and track the topic and clearly reflect the trend of topic evolution.
Abstract: With the rapid development of the Internet, the amount of data has grown exponentially. On the one hand, the accumulation of big data provides the basic support for artificial intelligence. On the other hand, in the face of such huge data information, how to extract the knowledge of interest from it has become a matter of general concern. Topic tracking can help people to explore the process of topic development from the huge and complex network texts information. By effectively organizing large-scale news documents, a method for the evolution of news topics over time is proposed in this paper to realize the tracking and evolution of topics in the news text set. First, the LDA (latent Dirichlet allocation) model is used to extract topics from news texts and the Gibbs Sampling method is used to speculate parameters. The topic mining using the K-means method is compared to highlight the advantages of using LDA for topic discovery. Second, the improved single-pass algorithm is used to track news topics. The JS (Jensen-Shannon) divergence is used to measure the topic similarity, and the time decay function is introduced to improve the similarity between topics with the similar time. Finally, the strength of the news topic and the content change of the topic in different time windows are analyzed. The experiments show that the proposed method can effectively detect and track the topic and clearly reflect the trend of topic evolution.

Journal ArticleDOI
TL;DR: A novel model for short text topic modeling, referred as Conditional Random Field regularized Topic Model (CRFTM), which not only develops a generalized solution to alleviate the sparsity problem by aggregating short texts into pseudo-documents, but also leverages a Conditional random field regularized model that encourages semantically related words to share the same topic assignment.
Abstract: Short texts have become the prevalent format of information on the Internet. Inferring the topics of this type of messages becomes a critical and challenging task for many applications. Due to the length of short texts, conventional topic models (e.g., latent Dirichlet allocation and its variants) suffer from the severe data sparsity problem which makes topic modeling of short texts difficult and unreliable. Recently, word embeddings have been proved effective to capture semantic and syntactic information about words, which can be used to induce similarity measures and semantic correlations among words. Enlightened by this, in this paper, we design a novel model for short text topic modeling, referred as Conditional Random Field regularized Topic Model (CRFTM). CRFTM not only develops a generalized solution to alleviate the sparsity problem by aggregating short texts into pseudo-documents, but also leverages a Conditional Random Field regularized model that encourages semantically related words to share the same topic assignment. Experimental results on two real-world datasets show that our method can extract more coherent topics, and significantly outperform state-of-the-art baselines on several evaluation metrics.

Journal ArticleDOI
TL;DR: The results indicate that the proposed method provides an efficient and economic performance summary of a university and its competitors, and could help its leaders in recruitment and retention efforts.

Journal ArticleDOI
10 Apr 2019-Symmetry
TL;DR: A three-step research methodology based on data text mining (DTM) that can be used for business intelligence analysis (BIA) strategies to analyze user generated content (UGC) in social networks and on digital platforms is proposed.
Abstract: The global development of the Internet, which enabled the analysis of large amounts of data and the services linked to their use, has led companies to modify their business strategies in search of new ways to increase marketing productivity and profitability. Many strategies are based on business intelligence (BI) and marketing intelligence (MI) that make it possible to extract profitable knowledge and insights from large amounts of data generated by company customers in digital environments. In this context, the present study proposes a three-step research methodology based on data text mining (DTM). In further research, this methodology can be used for business intelligence analysis (BIA) strategies to analyze user generated content (UGC) in social networks and on digital platforms. The proposed methodology unfolds in the following three stages. First, a Latent Dirichlet Allocation (LDA) model that determines the database topic is used. Second, a sentiment analysis (SA) is proposed. This SA is applied to the LDA results to divide the topics identified in the sample into negative, positive, and neutral sentiments. Thirdly, textual analysis (TA) with data text mining techniques is applied on the topics in each sentiment. The proposed methodology offers important advances in data text mining in terms of accuracy, reliability and insight generation for both researchers and practitioners seeking to improve the BIA processes in business and other sectors.