scispace - formally typeset
Search or ask a question

Showing papers on "Latent Dirichlet allocation published in 2017"


Journal ArticleDOI
TL;DR: This paper identifies the key dimensions of customer service voiced by hotel visitors use a data mining approach, latent dirichlet analysis (LDA), which uncovers 19 controllable dimensions that are key for hotels to manage their interactions with visitors.

570 citations


Posted Content
TL;DR: In this article, the authors investigated the research development, current trends and intellectual structure of topic modeling based on Latent Dirichlet Allocation (LDA), and summarized challenges and introduced famous tools and datasets in topic modelling based on LDA.
Abstract: Topic modeling is one of the most powerful techniques in text mining for data mining, latent data discovery, and finding relationships among data, text documents. Researchers have published many articles in the field of topic modeling and applied in various fields such as software engineering, political science, medical and linguistic science, etc. There are various methods for topic modeling, which Latent Dirichlet allocation (LDA) is one of the most popular methods in this field. Researchers have proposed various models based on the LDA in topic modeling. According to previous work, this paper can be very useful and valuable for introducing LDA approaches in topic modeling. In this paper, we investigated scholarly articles highly (between 2003 to 2016) related to Topic Modeling based on LDA to discover the research development, current trends and intellectual structure of topic modeling. Also, we summarize challenges and introduce famous tools and datasets in topic modeling based on LDA.

546 citations


Journal ArticleDOI
TL;DR: This research develops a statistical framework to help discover semantically meaningful topics and functional regions based on the co‐occurrence patterns of POI types and demonstrates the effectiveness of the proposed methodology by identifying distinctive types of latent topics and by extracting urban functional regions.
Abstract: Data about points of interest (POI) have been widely used in studying urban land use types and for sensing human behavior. However, it is difficult to quantify the correct mix or the spatial relations among different POI types indicative of specific urban functions. In this research, we develop a statistical framework to help discover semantically meaningful topics and functional regions based on the co-occurrence patterns of POI types. The framework applies the latent Dirichlet allocation (LDA) topic modeling technique and incorporates user check-in activities on location-based social networks. Using a large corpus of about 100,000 Foursquare venues and user check-in behavior in the 10 most populated urban areas of the US, we demonstrate the effectiveness of our proposed methodology by identifying distinctive types of latent topics and, further, by extracting urban functional regions using K-means clustering and Delaunay triangulation spatial constraints clustering. We show that a region can support multiple functions but with different probabilities, while the same type of functional region can span multiple geographically non-adjacent locations. Since each region can be modeled as a vector consisting of multinomial topic distributions, similar regions with regard to their thematic topic signatures can be identified. Compared with remote sensing images which mainly uncover the physical landscape of urban environments, our popularity-based POI topic modeling approach can be seen as a complementary social sensing view on urban space based on human activities.

283 citations


Journal ArticleDOI
TL;DR: In this paper, the authors examined trends in 10-K disclosure over the period 1996-2013, with increases in length, boilerplate, stickiness, and redundancy and decreases in specificity, readability, and the relative amount of hard information.

259 citations


Posted Content
TL;DR: This work presents what is to their knowledge the first effective AEVB based inference method for latent Dirichlet allocation (LDA), which it is called Autoencoded Variational Inference For Topic Model (AVITM).
Abstract: Topic models are one of the most popular methods for learning representations of text, but a major challenge is that any change to the topic model requires mathematically deriving a new inference algorithm. A promising approach to address this problem is autoencoding variational Bayes (AEVB), but it has proven diffi- cult to apply to topic models in practice. We present what is to our knowledge the first effective AEVB based inference method for latent Dirichlet allocation (LDA), which we call Autoencoded Variational Inference For Topic Model (AVITM). This model tackles the problems caused for AEVB by the Dirichlet prior and by component collapsing. We find that AVITM matches traditional methods in accuracy with much better inference time. Indeed, because of the inference network, we find that it is unnecessary to pay the computational cost of running variational optimization on test data. Because AVITM is black box, it is readily applied to new topic models. As a dramatic illustration of this, we present a new topic model called ProdLDA, that replaces the mixture model in LDA with a product of experts. By changing only one line of code from LDA, we find that ProdLDA yields much more interpretable topics, even if LDA is trained via collapsed Gibbs sampling.

258 citations


Proceedings ArticleDOI
01 Oct 2017
TL;DR: It is shown that document frequency, document word length, and vocabulary size have mixed practical effects on topic coherence and human topic ranking of LDA topics, and that large document collections are less affected by incorrect or noise terms being part of the topic-word distributions, causing topics to be more coherent and ranked higher.
Abstract: This paper assesses topic coherence and human topic ranking of uncovered latent topics from scientific publications when utilizing the topic model latent Dirichlet allocation (LDA) on abstract and full-text data. The coherence of a topic, used as a proxy for topic quality, is based on the distributional hypothesis that states that words with similar meaning tend to co-occur within a similar context. Although LDA has gained much attention from machine-learning researchers, most notably with its adaptations and extensions, little is known about the effects of different types of textual data on generated topics. Our research is the first to explore these practical effects and shows that document frequency, document word length, and vocabulary size have mixed practical effects on topic coherence and human topic ranking of LDA topics. We furthermore show that large document collections are less affected by incorrect or noise terms being part of the topic-word distributions, causing topics to be more coherent and ranked higher. Differences between abstract and full-text data are more apparent within small document collections, with differences as large as 90% high-quality topics for full-text data, compared to 50% high-quality topics for abstract data.

219 citations


Proceedings Article
26 Apr 2017
TL;DR: This paper proposed a new topic model called ProdLDA, which replaces the mixture model in LDA with a product of experts and showed that the new model yields much more interpretable topics.
Abstract: Topic models are one of the most popular methods for learning representations of text, but a major challenge is that any change to the topic model requires mathematically deriving a new inference algorithm. A promising approach to address this problem is autoencoding variational Bayes (AEVB), but it has proven diffi- cult to apply to topic models in practice. We present what is to our knowledge the first effective AEVB based inference method for latent Dirichlet allocation (LDA), which we call Autoencoded Variational Inference For Topic Model (AVITM). This model tackles the problems caused for AEVB by the Dirichlet prior and by component collapsing. We find that AVITM matches traditional methods in accuracy with much better inference time. Indeed, because of the inference network, we find that it is unnecessary to pay the computational cost of running variational optimization on test data. Because AVITM is black box, it is readily applied to new topic models. As a dramatic illustration of this, we present a new topic model called ProdLDA, that replaces the mixture model in LDA with a product of experts. By changing only one line of code from LDA, we find that ProdLDA yields much more interpretable topics, even if LDA is trained via collapsed Gibbs sampling.

213 citations


Journal ArticleDOI
TL;DR: This work identifies two important problems along the way to using topic models in qualitative studies: lack of a good quality metric that closely matches human judgement in understanding topics and the need to indicate specific subtopics that a specific qualitative study may be most interested in mining.
Abstract: Qualitative studies, such as sociological research, opinion analysis and media studies, can benefit greatly from automated topic mining provided by topic models such as latent Dirichlet allocation LDA. However, examples of qualitative studies that employ topic modelling as a tool are currently few and far between. In this work, we identify two important problems along the way to using topic models in qualitative studies: lack of a good quality metric that closely matches human judgement in understanding topics and the need to indicate specific subtopics that a specific qualitative study may be most interested in mining. For the first problem, we propose a new quality metric, tf-idf coherence, that reflects human judgement more accurately than regular coherence, and conduct an experiment to verify this claim. For the second problem, we propose an interval semi-supervised approach ISLDA where certain predefined sets of keywords that define the topics researchers are interested in are restricted to specific intervals of topic assignments. Our experiments show that ISLDA is better for topic extraction than LDA in terms of tf-idf coherence, number of topics identified to predefined keywords and topic stability. We also present a case study on a Russian LiveJournal dataset aimed at ethnicity discourse analysis.

204 citations


Journal ArticleDOI
TL;DR: The authors examined analyst information intermediary roles using a textual analysis of analyst reports and corporate disclosures, using a topic modeling methodology from computational linguisti... and employed a topic model to model the relationship between analysts and their intermediary roles.
Abstract: This study examines analyst information intermediary roles using a textual analysis of analyst reports and corporate disclosures. We employ a topic modeling methodology from computational linguisti...

203 citations


Journal ArticleDOI
TL;DR: The experimental results show that iDoctor provides a higher predication rating and increases the accuracy of healthcare recommendation significantly, and is compared with previous healthcare recommendation methods using real datasets.

203 citations


Journal ArticleDOI
TL;DR: An empirical analysis of 17,163 articles published in 22 leading transportation journals from 1990 to 2015 using a latent Dirichlet allocation (LDA) model to infer 50 key topics is presented, suggesting that research communities in different regions tend to focus on different sub-fields.
Abstract: Transportation research is a key area in both science and engineering. In this paper, we present an empirical analysis of 17,163 articles published in 22 leading transportation journals from 1990 to 2015. We apply a latent Dirichlet allocation (LDA) model on article abstracts to infer 50 key topics. We show that those characterized topics are both representative and meaningful, mostly corresponding to established sub-fields in transportation research. These identified fields reveal a research landscape for transportation. Based on the results of LDA, we quantify the similarity of journals and countries/regions in terms of their aggregated topic distributions. By measuring the variation of topic distributions over time, we find some general research trends, such as topics on sustainability, travel behavior and non-motorized mobility are becoming increasingly popular over time. We also carry out this temporal analysis for each journal, observing a high degree of consistency for most journals. However, some interesting anomaly, such as special issues on particular topics, are detected from temporal variation as well. By quantifying the temporal trends at the country/region level, we find that countries/regions display clearly distinguishable patterns, suggesting that research communities in different regions tend to focus on different sub-fields. Our results could benefit different parties in the academic community—including researchers, journal editors and funding agencies—in terms of identifying promising research topics/projects, seeking for candidate journals for a submission, and realigning focus for journal development.

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a novel framework, namely, multi-query expansions, to retrieve semantically robust landmarks by two steps, where the top-$k$ photos regarding the latent topics of a query landmark were identified to construct a multiquery set so as to remedy its possible low quality shape.
Abstract: Given a query photo issued by a user (q-user), the landmark retrieval is to return a set of photos with their landmarks similar to those of the query, while the existing studies on the landmark retrieval focus on exploiting geometries of landmarks for similarity matches between candidate photos and a query photo. We observe that the same landmarks provided by different users over social media community may convey different geometry information depending on the viewpoints and/or angles, and may, subsequently, yield very different results. In fact, dealing with the landmarks with low quality shapes caused by the photography of q-users is often nontrivial and has seldom been studied. In this paper, we propose a novel framework, namely, multi-query expansions, to retrieve semantically robust landmarks by two steps. First, we identify the top- $k$ photos regarding the latent topics of a query landmark to construct multi-query set so as to remedy its possible low quality shape. For this purpose, we significantly extend the techniques of Latent Dirichlet Allocation. Then, motivated by the typical collaborative filtering methods, we propose to learn a collaborative deep networks-based semantically, nonlinear, and high-level features over the latent factor for landmark photo as the training set, which is formed by matrix factorization over collaborative user-photo matrix regarding the multi-query set. The learned deep network is further applied to generate the features for all the other photos, meanwhile resulting into a compact multi-query set within such space. Then, the final ranking scores are calculated over the high-level feature space between the multi-query set and all other photos, which are ranked to serve as the final ranking list of landmark retrieval. Extensive experiments are conducted on real-world social media data with both landmark photos together with their user information to show the superior performance over the existing methods, especially our recently proposed multi-query based mid-level pattern representation method [1] .

Journal ArticleDOI
TL;DR: It is shown that social media analytics captures spatial patterns within the city that relate to the presence of users and the environmental and topical engagement, and how these patterns serve as an input to value creation for smart urban tourism.

Journal ArticleDOI
TL;DR: H Hierarchical semantic cognition is presented in this study, and serves as a general cognition structure for recognizing urban functional zones and can further support urban planning and management.
Abstract: As the basic units of urban areas, functional zones are essential for city planning and management, but functional-zone maps are hardly available in most cities, as traditional urban investigations focus mainly on land-cover objects instead of functional zones. As a result, an automatic/semi-automatic method for mapping urban functional zones is highly required. Hierarchical semantic cognition (HSC) is presented in this study, and serves as a general cognition structure for recognizing urban functional zones. Unlike traditional classification methods, the HSC relies on geographic cognition and considers four semantic layers, i.e., visual features, object categories, spatial object patterns, and zone functions, as well as their hierarchical relations. Here, we used HSC to classify functional zones in Beijing with a very-high-resolution (VHR) satellite image and point-of-interest (POI) data. Experimental results indicate that this method can produce more accurate results than Support Vector Machine (SVM) and Latent Dirichlet Allocation (LDA) with a larger overall accuracy of 90.8%. Additionally, the contributions of diverse semantic layers are quantified: the object-category layer is the most important and makes 54% contribution to functional-zone classification; while, other semantic layers are less important but their contributions cannot be ignored. Consequently, the presented HSC is effective in classifying urban functional zones, and can further support urban planning and management.

Journal ArticleDOI
TL;DR: The authors presented a framework that automatically derives latent brand topics and classifies brand sentiments on 1.7 million unique tweets for 20 brands across five industries: fast food, department store, footwear, electronics, and telecommunications.
Abstract: The big data of user-generated content (UGC) on social media are laden with potential value for brand managers. However, there are many obstacles to using big data to answer brand-management questions. This article presents a framework that automatically derives latent brand topics and classifies brand sentiments. It applies text mining with latent Dirichlet allocation (LDA) and sentiment analysis on 1.7 million unique tweets for 20 brands across five industries: fast food, department store, footwear, electronics, and telecommunications. The framework is used to explore four brand-related questions on Twitter. There are three main findings. First, product, service, and promotions are the dominant topics of interest when consumers interact with brands on Twitter. Second, consumer sentiments toward brands vary within and across industries. Third, separate company-specific analyses of positive and negative tweets generate a more accurate understanding of Twitter users' major brand topics and sentiments. Our ...

Journal ArticleDOI
TL;DR: Several findings were unveiled including that hotel food generates ordinary positive sentiments, while hospitality generates both ordinary and strong positive feelings, valuable for hospitality management, validating the proposed approach.
Abstract: The development of the Internet and mobile devices enabled the emergence of travel and hospitality review sites, leading to a large number of customer opinion posts. While such comments may influence future demand of the targeted hotels, they can also be used by hotel managers to improve customer experience. In this article, sentiment classification of an eco-hotel is assessed through a text mining approach using several different sources of customer reviews. The latent Dirichlet allocation modeling algorithm is applied to gather relevant topics that characterize a given hospitality issue by a sentiment. Several findings were unveiled including that hotel food generates ordinary positive sentiments, while hospitality generates both ordinary and strong positive feelings. Such results are valuable for hospitality management, validating the proposed approach.

Journal ArticleDOI
TL;DR: The results confirm that the features derived using the proposed lexicon outperform those from state-of-the-art lexicons learnt using supervised Latent Dirichlet Allocation (sLDA) and Point-Wise Mutual Information (PMI).

Journal ArticleDOI
TL;DR: The TGSC-PMF model exploits textual information, geographical information, social information, categorical information and popularity information, and incorporates these factors effectively and achieves significantly superior recommendation quality compared to other state-of-the-art POI recommendation techniques.

Journal ArticleDOI
TL;DR: A bilevel feature extraction-based text mining that integrates features extracted at both syntax and semantic levels with the aim to improve the fault classification performance and enhances the precision of fault diagnosis for all fault classes, particularly minority ones is proposed.
Abstract: A vast amount of text data is recorded in the forms of repair verbatim in railway maintenance sectors. Efficient text mining of such maintenance data plays an important role in detecting anomalies and improving fault diagnosis efficiency. However, unstructured verbatim, high-dimensional data, and imbalanced fault class distribution pose challenges for feature selections and fault diagnosis. We propose a bilevel feature extraction-based text mining that integrates features extracted at both syntax and semantic levels with the aim to improve the fault classification performance. We first perform an improved $\chi^{2}$ statistics-based feature selection at the syntax level to overcome the learning difficulty caused by an imbalanced data set. Then, we perform a prior latent Dirichlet allocation-based feature selection at the semantic level to reduce the data set into a low-dimensional topic space. Finally, we fuse fault features derived from both syntax and semantic levels via serial fusion. The proposed method uses fault features at different levels and enhances the precision of fault diagnosis for all fault classes, particularly minority ones. Its performance has been validated by using a railway maintenance data set collected from 2008 to 2014 by a railway corporation. It outperforms traditional approaches.

Journal ArticleDOI
TL;DR: Correlation Explanation is introduced, an alternative approach to topic modeling that does not assume an underlying generative model, and instead learns maximally informative topics through an information-theoretic framework that generalizes to hierarchical and semi-supervised extensions with no additional modeling assumptions.
Abstract: While generative models such as Latent Dirichlet Allocation (LDA) have proven fruitful in topic modeling, they often require detailed assumptions and careful specification of hyperparameters. Such model complexity issues only compound when trying to generalize generative models to incorporate human input. We introduce Correlation Explanation (CorEx), an alternative approach to topic modeling that does not assume an underlying generative model, and instead learns maximally informative topics through an information-theoretic framework. This framework naturally generalizes to hierarchical and semi-supervised extensions with no additional modeling assumptions. In particular, word-level domain knowledge can be flexibly incorporated within CorEx through anchor words, allowing topic separability and representation to be promoted with minimal human intervention. Across a variety of datasets, metrics, and experiments, we demonstrate that CorEx produces topics that are comparable in quality to those produced by unsupervised and semi-supervised variants of LDA.

Book ChapterDOI
23 May 2017
TL;DR: This paper proposed Embedding-based Topic Model (ETM) to learn latent topics from short texts by aggregating short texts into long pseudo-texts and using a Markov Random Field regularized model that gives correlated words a better chance to be put into the same topic.
Abstract: Inferring topics from the overwhelming amount of short texts becomes a critical but challenging task for many content analysis tasks. Existing methods such as probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA) cannot solve this problem very well since only very limited word co-occurrence information is available in short texts. This paper studies how to incorporate the external word correlation knowledge into short texts to improve the coherence of topic modeling. Based on recent results in word embeddings that learn semantically representations for words from a large corpus, we introduce a novel method, Embedding-based Topic Model (ETM), to learn latent topics from short texts. ETM not only solves the problem of very limited word co-occurrence information by aggregating short texts into long pseudo-texts, but also utilizes a Markov Random Field regularized model that gives correlated words a better chance to be put into the same topic. The experiments on real-world datasets validate the effectiveness of our model comparing with the state-of-the-art models.

Journal ArticleDOI
TL;DR: A scalable Bayesian topic model is proposed to measure and understand changes in consumer opinion about health (and other topics) and calibrate the model on 761,962 online reviews of restaurants posted over eight years.
Abstract: In 2008, New York City mandated that all chain restaurants post calorie information on their menus. For managers of chain and standalone restaurants, as well as for policy makers, a pertinent goal might be to monitor the impact of this regulation on consumer conversations. We propose a scalable Bayesian topic model to measure and understand changes in consumer opinion about health (and other topics). We calibrate the model on 761,962 online reviews of restaurants posted over eight years. Our model allows managers to specify prior topics of interest such as “health” for a calorie posting regulation. It also allows the distribution of topic proportions within a review to be affected by its length, valence, and the experience level of its author. Using a difference-in-differences estimation approach, we isolate the potentially causal effect of the regulation on consumer opinion. Following the regulation, there was a statistically small but significant increase in the proportion of discussion of the health to...

Journal ArticleDOI
TL;DR: This study summarises the research trends in SEE based upon a corpus of 1178 articles and identifies the core research areas and trends which may lead the researchers to understand and discern the research patterns in large literature dataset.
Abstract: Context Software effort estimation (SEE) is most crucial activity in the field of software engineering. Vast research has been conducted in SEE resulting into a tremendous increase in literature. Thus it is of utmost importance to identify the core research areas and trends in SEE which may lead the researchers to understand and discern the research patterns in large literature dataset. Objective To identify unobserved research patterns through natural language processing from a large set of research articles on SEE published during the period 1996 to 2016. Method A generative statistical method, called Latent Dirichlet Allocation (LDA), applied on a literature dataset of 1178 articles published on SEE. Results As many as twelve core research areas and sixty research trends have been revealed; and the identified research trends have been semantically mapped to associate core research areas. Conclusions This study summarises the research trends in SEE based upon a corpus of 1178 articles. The patterns and trends identified through this research can help in finding the potential research areas.

Journal ArticleDOI
TL;DR: A demonstration of unsupervised learning based analysis of the leading telecommunication firms between 2001 and 2014 based on about 160,000 USPTO full-text patents shows company-specific differences in their knowledge profiles, as well as shows the evolution of the knowledge profiles of industry leaders from hardware to software focussed technology strategies.

Journal ArticleDOI
TL;DR: A framework to generate pseudo-documents suitable for topic modeling of short text by creating larger pseudo-document representations from the original documents is proposed and two simple, effective and efficient methods that specialize the general framework to create larger Pseudo-Documents are presented.

Journal ArticleDOI
TL;DR: A topic-based bibliometric study to detect and predict the topic changes of KnoSys from 1991 to 2016 is conducted and indicates that the interest of K noSys communities in the area of computational intelligence is raised, and the ability to construct practical systems through knowledge use and accurate prediction models is highly emphasized.
Abstract: The journal Knowledge-based Systems (KnoSys) has been published for over 25 years, during which time its main foci have been extended to a broad range of studies in computer science and artificial intelligence. Answering the questions: “What is the KnoSys community interested in?” and “How does such interest change over time?” are important to both the editorial board and audience of KnoSys. This paper conducts a topic-based bibliometric study to detect and predict the topic changes of KnoSys from 1991 to 2016. A Latent Dirichlet Allocation model is used to profile the hotspots of KnoSys and predict possible future trends from a probabilistic perspective. A model of scientific evolutionary pathways applies a learning-based process to detect the topic changes of KnoSys in sequential time slices. Six main research areas of KnoSys are identified, i.e., expert systems, machine learning, data mining, decision making, optimization, and fuzzy, and the results also indicate that the interest of KnoSys communities in the area of computational intelligence is raised, and the ability to construct practical systems through knowledge use and accurate prediction models is highly emphasized. Such empirical insights can be used as a guide for KnoSys submissions.

Proceedings ArticleDOI
15 Jun 2017
TL;DR: Methods of Topic Modeling which includes Vector Space Model (VSM), Latent Semantic Indexing (LSI), Probabilistic LatentSemantic Analysis (PLSA),Latent Dirichlet Allocation (LDA) with their features and limitations are discussed.
Abstract: Topic modeling is a powerful technique for analysis of a huge collection of a document. Topic modeling is used for discovering hidden structure from the collection of a document. The topic is viewed as a recurring pattern of co-occurring words. A topic includes a group of words that often occurs together. Topic modeling can link words with the same context and differentiate across the uses of words with different meanings. In this paper, we discuss methods of Topic Modeling which includes Vector Space Model (VSM), Latent Semantic Indexing (LSI), Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Allocation (LDA) with their features and limitations. After that, we will discuss tools available for topic modeling such as Gensim, Standford topic modeling toolbox, MALLET, BigARTM. Then some of the applications of Topic Modeling covered. Topic models have a wide range of applications like tag recommendation, text categorization, keyword extraction, information filtering and similarity search in the fields of text mining, information retrieval.

Proceedings ArticleDOI
07 Jul 2017
TL;DR: An approach for preprocessing requirements that standardizes and normalizes requirements before applying classification algorithms for automated classification of NFRs into sub-categories such as usability, availability, or performance is contributed.
Abstract: Classifying requirements into functional requirements (FR) and non-functional ones (NFR) is an important task in requirements engineering. However, automated classification of requirements written in natural language is not straightforward, due to the variability of natural language and the absence of a controlled vocabulary. This paper investigates how automated classification of requirements into FR and NFR can be improved and how well several machine learning approaches work in this context. We contribute an approach for preprocessing requirements that standardizes and normalizes requirements before applying classification algorithms. Further, we report on how well several existing machine learning methods perform for automated classification of NFRs into sub-categories such as usability, availability, or performance. Our study is performed on 625 requirements provided by the OpenScience tera-PROMISE repository. We found that our preprocessing improved the performance of an existing classification method. We further found significant differences in the performance of approaches such as Latent Dirichlet Allocation, Biterm Topic Modeling, or Naive Bayes for the sub-classification of NFRs.

Proceedings ArticleDOI
01 Jun 2017
TL;DR: An augmented LDA model is proposed which leverages the high-quality word vectors obtained by Word2vec to improve the performance of Web services clustering and has an average improvement of 5.3% of the clustering accuracy with various metrics.
Abstract: Due to the rapid growth in both the number and diversity of Web services on the web, it becomes increasingly difficult for us to find the desired and appropriate Web services nowadays. Clustering Web services according to their functionalities becomes an efficient way to facilitate the Web services discovery as well as the services management. Existing methods for Web services clustering mostly focus on utilizing directly key features from WSDL documents, e.g., input/output parameters and keywords from description text. Probabilistic topic model Latent Dirichlet Allocation (LDA) is also adopted, which extracts latent topic features of WSDL documents to represent Web services, to improve the accuracy of Web services clustering. However, the power of the basic LDA model for clustering is limited to some extent. Some auxiliary features can be exploited to enhance the ability of LDA. Since the word vectors obtained by Word2vec is with higher quality than those obtained by LDA model, we propose, in this paper, an augmented LDA model (named WE-LDA) which leverages the high-quality word vectors to improve the performance of Web services clustering. In WE-LDA, the word vectors obtained by Word2vec are clustered into word clusters by K-means++ algorithm and these word clusters are incorporated to semi-supervise the LDA training process, which can elicit better distributed representations of Web services. A comprehensive experiment is conducted to validate the performance of the proposed method based on a ground truth dataset crawled from ProgrammableWeb. Compared with the state-of-the-art, our approach has an average improvement of 5.3% of the clustering accuracy with various metrics.

Journal ArticleDOI
TL;DR: Empirical evaluation confirms that UMM generated emotion language models (topics) have significantly lower perplexity compared to those from state-of-the-art generative models like supervised Latent Dirichlet Allocation (sLDA).
Abstract: General-purpose emotion lexicons (GPELs) that associate words with emotion categories remain a valuable resource for emotion detection. However, the static and formal nature of their vocabularies make them an inadequate resource for detecting emotions in domains that are inherently dynamic in nature. This calls for lexicons that are not only adaptive to the lexical variations in a domain but which also provide finer-grained quantitative estimates to accurately capture word-emotion associations. In this article, the authors demonstrate how to harness labeled emotion text (such as blogs and news headlines) and weakly labeled emotion text (such as tweets) to learn a word-emotion association lexicon by jointly modeling emotionality and neutrality of words using a generative unigram mixture model (UMM). Empirical evaluation confirms that UMM generated emotion language models (topics) have significantly lower perplexity compared to those from state-of-the-art generative models like supervised Latent Dirichlet Allocation (sLDA). Further emotion detection tasks involving word-emotion classification and document-emotion ranking confirm that the UMM lexicon significantly out performs GPELs and also state-of-the-art domain specific lexicons.