scispace - formally typeset
Search or ask a question

Showing papers on "Latent Dirichlet allocation published in 2014"


Proceedings ArticleDOI
06 Oct 2014
TL;DR: In this paper, the authors propose a parameter server framework for distributed machine learning problems, where both data and workloads are distributed over worker nodes, while the server nodes maintain globally shared parameters, represented as dense or sparse vectors and matrices.
Abstract: We propose a parameter server framework for distributed machine learning problems. Both data and workloads are distributed over worker nodes, while the server nodes maintain globally shared parameters, represented as dense or sparse vectors and matrices. The framework manages asynchronous data communication between nodes, and supports flexible consistency models, elastic scalability, and continuous fault tolerance.To demonstrate the scalability of the proposed framework, we show experimental results on petabytes of real data with billions of examples and parameters on problems ranging from Sparse Logistic Regression to Latent Dirichlet Allocation and Distributed Sketching.

1,034 citations


Proceedings ArticleDOI
01 Jan 2014
TL;DR: LDAvis, a web-based interactive visualization of topics estimated using Latent Dirichlet Allocation that is built using a combination of R and D3, and a novel method for choosing which terms to present to a user to aid in the task of topic interpretation is proposed.
Abstract: We present LDAvis, a web-based interactive visualization of topics estimated using Latent Dirichlet Allocation that is built using a combination of R and D3. Our visualization provides a global view of the topics (and how they differ from each other), while at the same time allowing for a deep inspection of the terms most highly associated with each individual topic. First, we propose a novel method for choosing which terms to present to a user to aid in the task of topic interpretation, in which we define the relevance of a term to a topic. Second, we present results from a user study that suggest that ranking terms purely by their probability under a topic is suboptimal for topic interpretation. Last, we describe LDAvis, our visualization system that allows users to flexibly explore topic-term relationships using relevance to better understand a fitted LDA model.

836 citations


Journal ArticleDOI
TL;DR: In this article, the authors propose a unified framework for this purpose using unsupervised latent Dirichlet allocation, which enables marketers to track dimensions' importance over time and allows for dynamic mapping of competitive brand positions on those dimensions over time.
Abstract: Online chatter, or user-generated content, constitutes an excellent emerging source for marketers to mine meaning at a high temporal frequency. This article posits that this meaning consists of extracting the key latent dimensions of consumer satisfaction with quality and ascertaining the valence, labels, validity, importance, dynamics, and heterogeneity of those dimensions. The authors propose a unified framework for this purpose using unsupervised latent Dirichlet allocation. The sample of user-generated content consists of rich data on product reviews across 15 firms in five markets over four years. The results suggest that a few dimensions with good face validity and external validity are enough to capture quality. Dynamic analysis enables marketers to track dimensions' importance over time and allows for dynamic mapping of competitive brand positions on those dimensions over time. For vertically differentiated markets (e.g., mobile phones, computers), objective dimensions dominate and are similar acr...

562 citations


Proceedings ArticleDOI
01 Apr 2014
TL;DR: This work explores the two tasks of automatic Evaluation of single topics and automatic evaluation of whole topic models, and provides recommendations on the best strategy for performing the two task, in addition to providing an open-source toolkit for topic and topic model evaluation.
Abstract: Topic models based on latent Dirichlet allocation and related methods are used in a range of user-focused tasks including document navigation and trend analysis, but evaluation of the intrinsic quality of the topic model and topics remains an open research area. In this work, we explore the two tasks of automatic evaluation of single topics and automatic evaluation of whole topic models, and provide recommendations on the best strategy for performing the two tasks, in addition to providing an open-source toolkit for topic and topic model evaluation.

493 citations


Journal ArticleDOI
TL;DR: This paper proposes a novel way for short text topic modeling, referred as biterm topic model (BTM), which learns topics by directly modeling the generation of word co-occurrence patterns in the corpus, making the inference effective with the rich corpus-level information.
Abstract: Short texts are popular on today’s web, especially with the emergence of social media. Inferring topics from large scale short texts becomes a critical but challenging task for many content analysis tasks. Conventional topic models such as latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (PLSA) learn topics from document-level word co-occurrences by modeling each document as a mixture of topics, whose inference suffers from the sparsity of word co-occurrence patterns in short texts. In this paper, we propose a novel way for short text topic modeling, referred as biterm topic model (BTM) . BTM learns topics by directly modeling the generation of word co-occurrence patterns (i.e., biterms) in the corpus, making the inference effective with the rich corpus-level information. To cope with large scale short text data, we further introduce two online algorithms for BTM for efficient topic learning. Experiments on real-word short text collections show that BTM can discover more prominent and coherent topics, and significantly outperform the state-of-the-art baselines. We also demonstrate the appealing performance of the two online BTM algorithms on both time efficiency and topic learning.

452 citations


Posted Content
TL;DR: This paper studies Bayesian optimization for constrained problems in the general case that noise may be present in the constraint functions, and the objective and constraints may be evaluated independently.
Abstract: Recent work on Bayesian optimization has shown its effectiveness in global optimization of difficult black-box objective functions. Many real-world optimization problems of interest also have constraints which are unknown a priori. In this paper, we study Bayesian optimization for constrained problems in the general case that noise may be present in the constraint functions, and the objective and constraints may be evaluated independently. We provide motivating practical examples, and present a general framework to solve such problems. We demonstrate the effectiveness of our approach on optimizing the performance of online latent Dirichlet allocation subject to topic sparsity constraints, tuning a neural network given test-time memory constraints, and optimizing Hamiltonian Monte Carlo to achieve maximal effectiveness in a fixed time, subject to passing standard convergence diagnostics.

303 citations


Proceedings ArticleDOI
24 Aug 2014
TL;DR: This work examined the use of latent variable models to decompose free-text hospital notes into meaningful features, and found that latent topic-derived features were effective in determining patient mortality under three timelines: in-hospital, 30 day post- Discharge, and 1 year post-discharge mortality.
Abstract: Accurate knowledge of a patient's disease state and trajectory is critical in a clinical setting. Modern electronic healthcare records contain an increasingly large amount of data, and the ability to automatically identify the factors that influence patient outcomes stand to greatly improve the efficiency and quality of care.We examined the use of latent variable models (viz. Latent Dirichlet Allocation) to decompose free-text hospital notes into meaningful features, and the predictive power of these features for patient mortality. We considered three prediction regimes: (1) baseline prediction, (2) dynamic (time-varying) outcome prediction, and (3) retrospective outcome prediction. In each, our prediction task differs from the familiar time-varying situation whereby data accumulates; since fewer patients have long ICU stays, as we move forward in time fewer patients are available and the prediction task becomes increasingly difficult.We found that latent topic-derived features were effective in determining patient mortality under three timelines: in-hospital, 30 day post-discharge, and 1 year post-discharge mortality. Our results demonstrated that the latent topic features important in predicting hospital mortality are very different from those that are important in post-discharge mortality. In general, latent topic features were more predictive than structured features, and a combination of the two performed best.The time-varying models that combined latent topic features and baseline features had AUCs that reached 0.85, 0.80, and 0.77 for in-hospital, 30 day post-discharge and 1 year post-discharge mortality respectively. Our results agreed with other work suggesting that the first 24 hours of patient information are often the most predictive of hospital mortality. Retrospective models that used a combination of latent topic features and structured features achieved AUCs of 0.96, 0.82, and 0.81 for in-hospital, 30 day, and 1-year mortality prediction.Our work focuses on the dynamic (time-varying) setting because models from this regime could facilitate an on-going severity stratification system that helps direct care-staff resources and inform treatment strategies.

206 citations


Journal ArticleDOI
TL;DR: This paper investigates methods, including LDA and its extensions, for separating a set of scientific publications into several clusters and explores potential scientometric applications of such text analysis capabilities.
Abstract: Topic modeling is a type of statistical model for discovering the latent "topics" that occur in a collection of documents through machine learning. Currently, latent Dirichlet allocation (LDA) is a popular and common modeling approach. In this paper, we investigate methods, including LDA and its extensions, for separating a set of scientific publications into several clusters. To evaluate the results, we generate a collection of documents that contain academic papers from several different fields and see whether papers in the same field will be clustered together. We explore potential scientometric applications of such text analysis capabilities.

184 citations


Journal ArticleDOI
TL;DR: In this paper, the authors applied active learning with two criteria (certainty and uncertainty) and several enhancements in both clinical medicine and social science (specifically, public health) areas, and compared the results in both.

141 citations


Journal ArticleDOI
TL;DR: A Latent Dirichlet Allocation (LDA) based model is proposed, Foreground and Background LDA (FB-LDA), to distill foreground topics and filter out longstanding background topics to interpret sentiment variations.
Abstract: Millions of users share their opinions on Twitter, making it a valuable platform for tracking and analyzing public sentiment. Such tracking and analysis can provide critical information for decision making in various domains. Therefore it has attracted attention in both academia and industry. Previous research mainly focused on modeling and tracking public sentiment. In this work, we move one step further to interpret sentiment variations. We observed that emerging topics (named foreground topics ) within the sentiment variation periods are highly related to the genuine reasons behind the variations. Based on this observation, we propose a Latent Dirichlet Allocation (LDA) based model, Foreground and Background LDA (FB-LDA), to distill foreground topics and filter out longstanding background topics . These foreground topics can give potential interpretations of the sentiment variations. To further enhance the readability of the mined reasons, we select the most representative tweets for foreground topics and develop another generative model called Reason Candidate and Background LDA (RCB-LDA) to rank them with respect to their “popularity” within the variation period. Experimental results show that our methods can effectively find foreground topics and rank reason candidates. The proposed models can also be applied to other tasks such as finding topic differences between two sets of documents.

136 citations


Journal ArticleDOI
TL;DR: A new approach to using sources of metadata about votes to estimate the degree to which those votes are about common issues, using latent Dirichlet allocation to discover the extent to which different issues were at stake in different cases and estimating justice preferences within each of those issues is proposed.
Abstract: Item response theory models for roll-call voting data provide political scientists with parsimonious descriptions of political actors' relative preferences. However, models using only voting data tend to obscure variation in preferences across different issues due to identification and labeling problems that arise in multidimensional scaling models. We propose a new approach to using sources of metadata about votes to estimate the degree to which those votes are about common issues. We demonstrate our approach with votes and opinion texts from the U.S. Supreme Court, using latent Dirichlet allocation to discover the extent to which different issues were at stake in different cases and estimating justice preferences within each of those issues. This approach can be applied using a variety of unsupervised and supervised topic models for text, community detection models for networks, or any other tool capable of generating discrete or mixture categorization of subject matter from relevant vote-specific metadata.

Proceedings Article
21 Jun 2014
TL;DR: This paper proposes a suite of models for clustering high-dimensional data on a unit sphere based on von Mises-Fisher distribution and for discovering more intuitive clusters than existing approaches and develops fast variational methods as well as collapsed Gibbs sampling techniques for posterior inference.
Abstract: This paper proposes a suite of models for clustering high-dimensional data on a unit sphere based on von Mises-Fisher (vMF) distribution and for discovering more intuitive clusters than existing approaches. The proposed models include a) A Bayesian formulation of vMF mixture that enables information sharing among clusters, b) a Hierarchical vMF mixture that provides multiscale shrinkage and tree structured view of the data and c) a Temporal vMF mixture that captures evolution of clusters in temporal data. For posterior inference, we develop fast variational methods as well as collapsed Gibbs sampling techniques for all three models. Our experiments on six datasets provide strong empirical support in favour of vMF based clustering models over other popular tools such as K-means, Multinomial Mixtures and Latent Dirichlet Allocation

Journal ArticleDOI
TL;DR: This study proposes a novel approach to CP pattern discovery by modeling CPs using mixtures of an extension to the Latent Dirichlet Allocation family that jointly models various treatment activities and their occurring time stamps in CPs.

Proceedings Article
21 Jun 2014
TL;DR: This paper presents theorems elucidating the posterior contraction rates of the topics as the amount of data increases, and a thorough supporting empirical study using synthetic and real data sets, including news and web-based articles and tweet messages.
Abstract: Topic models such as the latent Dirichlet allocation (LDA) have become a standard staple in the modeling toolbox of machine learning. They have been applied to a vast variety of data sets, contexts, and tasks to varying degrees of success. However, to date there is almost no formal theory explicating the LDA's behavior, and despite its familiarity there is very little systematic analysis of and guidance on the properties of the data that affect the inferential performance of the model. This paper seeks to address this gap, by providing a systematic analysis of factors which characterize the LDA's performance. We present theorems elucidating the posterior contraction rates of the topics as the amount of data increases, and a thorough supporting empirical study using synthetic and real data sets, including news and web-based articles and tweet messages. Based on these results we provide practical guidance on how to identify suitable data sets for topic models, and how to specify particular model parameters.

Journal ArticleDOI
TL;DR: A first step towards evaluating topic models in the analysis of software evolution is taken by performing a detailed manual analysis on the source code histories of two well-known and well-documented systems, JHotDraw and jEdit.

Journal ArticleDOI
TL;DR: This is the first successful research of applying a probabilistic generative model to mine cybercriminal networks from online social media using a novel weakly supervised cybercriminal network mining method to facilitate cybercrime forensics.
Abstract: There has been a rapid growth in the number of cybercrimes that cause tremendous financial loss to organizations. Recent studies reveal that cybercriminals tend to collaborate or even transact cyber-attack tools via the "dark markets" established in online social media. Accordingly, it presents unprecedented opportunities for researchers to tap into these underground cybercriminal communities to develop better insights about collaborative cybercrime activities so as to combat the ever increasing number of cybercrimes. The main contribution of this paper is the development of a novel weakly supervised cybercriminal network mining method to facilitate cybercrime forensics. In particular, the proposed method is underpinned by a probabilistic generative model enhanced by a novel context-sensitive Gibbs sampling algorithm. Evaluated based on two social media corpora, our experimental results reveal that the proposed method significantly outperforms the Latent Dirichlet Allocation (LDA) based method and the Support Vector Machine (SVM) based method by 5.23% and 16.62% in terms of Area Under the ROC Curve (AUC), respectively. It also achieves comparable performance as the state-of-the-art Partially Labeled Dirichlet Allocation (PLDA) method. To the best of our knowledge, this is the first successful research of applying a probabilistic generative model to mine cybercriminal networks from online social media.

Proceedings ArticleDOI
01 Oct 2014
TL;DR: A semi-supervised model, called mixing population and individual property PU learning (MPIPUL), is proposed which outperforms the state-of-the-art baselines and spies examples and their similarity weights are incorporated into SVM (Support Vector Machine) to build an accurate classifier.
Abstract: Deceptive reviews detection has attracted significant attention from both business and research communities. However, due to the difficulty of human labeling needed for supervised learning, the problem remains to be highly challenging. This paper proposed a novel angle to the problem by modeling PU (positive unlabeled) learning. A semi-supervised model, called mixing population and individual property PU learning (MPIPUL), is proposed. Firstly, some reliable negative examples are identified from the unlabeled dataset. Secondly, some representative positive examples and negative examples are generated based on LDA (Latent Dirichlet Allocation). Thirdly, for the remaining unlabeled examples (we call them spy examples), which can not be explicitly identified as positive and negative, two similarity weights are assigned, by which the probability of a spy example belonging to the positive class and the negative class are displayed. Finally, spy examples and their similarity weights are incorporated into SVM (Support Vector Machine) to build an accurate classifier. Experiments on gold-standard dataset demonstrate the effectiveness of MPIPUL which outperforms the state-of-the-art baselines.

Journal ArticleDOI
TL;DR: This paper proposes TWILITE, a recommendation system for Twitter using probabilistic modeling based on latent Dirichlet allocation which recommends top-K users to follow andTop-K tweets to read for a user and develops an inference algorithm based on the variational EM algorithm for learning model parameters.

Proceedings ArticleDOI
Bing Xiang1, Liang Zhou
01 Jun 2014
TL;DR: The proposed sentiment model outperforms the top system in the task of Sentiment Analysis in Twitter in SemEval-2013 in terms of averaged F scores.
Abstract: In this paper, we present multiple approaches to improve sentiment analysis on Twitter data. We first establish a state-of-the-art baseline with a rich feature set. Then we build a topic-based sentiment mixture model with topic-specific data in a semi-supervised training framework. The topic information is generated through topic modeling based on an efficient implementation of Latent Dirichlet Allocation (LDA). The proposed sentiment model outperforms the top system in the task of Sentiment Analysis in Twitter in SemEval-2013 in terms of averaged F scores.

Journal ArticleDOI
TL;DR: The key findings are that exclusion of comments and literals from the corpus lowers accuracy and that heuristics for selecting LDA parameter values in the natural language context are suboptimal in the source code context.
Abstract: Feature location is a program comprehension activity, the goal of which is to identify source code entities that implement a functionality. Recent feature location techniques apply text retrieval models such as latent Dirichlet allocation (LDA) to corpora built from text embedded in source code. These techniques are highly configurable, and the literature offers little insight into how different configurations affect their performance. In this paper we present a study of an LDA based feature location technique (FLT) in which we measure the performance effects of using different configurations to index corpora and to retrieve 618 features from 6 open source Java systems. In particular, we measure the effects of the query, the text extractor configuration, and the LDA parameter values on the accuracy of the LDA based FLT. Our key findings are that exclusion of comments and literals from the corpus lowers accuracy and that heuristics for selecting LDA parameter values in the natural language context are suboptimal in the source code context. Based on the results of our case study, we offer specific recommendations for configuring the LDA based FLT.

Book ChapterDOI
01 Jan 2014
TL;DR: This paper provides a short, concise overview of some selected text mining methods, focusing on statistical methods, i.e. Latent Semantic Analysis, Probabilistic Latent seminar analysis, Hierarchical Latent Dirichlet Allocation, Principal Component Analysis, and Support Vector Machines, along with some examples from the biomedical domain.
Abstract: Text is a very important type of data within the biomedical domain. For example, patient records contain large amounts of text which has been entered in a non-standardized format, consequently posing a lot of challenges to processing of such data. For the clinical doctor the written text in the medical findings is still the basis for decision making – neither images nor multimedia data. However, the steadily increasing volumes of unstructured information need machine learning approaches for data mining, i.e. text mining. This paper provides a short, concise overview of some selected text mining methods, focusing on statistical methods, i.e. Latent Semantic Analysis, Probabilistic Latent Semantic Analysis, Latent Dirichlet Allocation, Hierarchical Latent Dirichlet Allocation, Principal Component Analysis, and Support Vector Machines, along with some examples from the biomedical domain. Finally, we provide some open problems and future challenges, particularly from the clinical domain, that we expect to stimulate future research.

Proceedings ArticleDOI
23 Jun 2014
TL;DR: This work proposes SupDocNADE, a supervised extension of DocNADE that increases the discriminative power of the hidden topic features by incorporating label information into the training objective of the model and shows how to employ Sup docNADE to learn a joint representation from image visual words, annotation words and class label information.
Abstract: Topic modeling based on latent Dirichlet allocation (LDA) has been a framework of choice to deal with multimodal data, such as in image annotation tasks. Recently, a new type of topic model called the Document Neural Autoregressive Distribution Estimator (DocNADE) was proposed and demonstrated state-of-the-art performance for text document modeling. In this work, we show how to successfully apply and extend this model to multimodal data, such as simultaneous image classification and annotation. Specifically, we propose SupDocNADE, a supervised extension of DocNADE, that increases the discriminative power of the hidden topic features by incorporating label information into the training objective of the model and show how to employ SupDocNADE to learn a joint representation from image visual words, annotation words and class label information. We also describe how to leverage information about the spatial position of the visual words for SupDocNADE to achieve better performance in a simple, yet effective manner. We test our model on the LabelMe and UIUC-Sports datasets and show that it compares favorably to other topic models such as the supervised variant of LDA and a Spatial Matching Pyramid (SPM) approach.

Journal ArticleDOI
13 Feb 2014-PLOS ONE
TL;DR: A novel variant of Latent Dirichlet Allocation topic modeling, Red-LDA, which takes into account the inherent redundancy of patient records when modeling content of clinical notes and produces superior models to all three baseline strategies.
Abstract: The clinical notes in a given patient record contain much redundancy, in large part due to clinicians’ documentation habit of copying from previous notes in the record and pasting into a new note. Previous work has shown that this redundancy has a negative impact on the quality of text mining and topic modeling in particular. In this paper we describe a novel variant of Latent Dirichlet Allocation (LDA) topic modeling, Red-LDA, which takes into account the inherent redundancy of patient records when modeling content of clinical notes. To assess the value of Red-LDA, we experiment with three baselines and our novel redundancy-aware topic modeling method: given a large collection of patient records, (i) apply vanilla LDA to all documents in all input records; (ii) identify and remove all redundancy by chosing a single representative document for each record as input to LDA; (iii) identify and remove all redundant paragraphs in each record, leaving partial, non-redundant documents as input to LDA; and (iv) apply Red-LDA to all documents in all input records. Both quantitative evaluation carried out through log-likelihood on held-out data and topic coherence of produced topics and qualitative assessement of topics carried out by physicians show that Red-LDA produces superior models to all three baseline strategies. This research contributes to the emerging field of understanding the characteristics of the electronic health record and how to account for them in the framework of data mining. The code for the two redundancy-elimination baselines and Red-LDA is made publicly available to the community.

Journal ArticleDOI
TL;DR: An extended latent Dirichlet allocation (LDA) model is presented in this paper for patent competitive intelligence analysis and reveals emerging hot spots of LTE technology, and finds that major companies in this field have been focused on different technological fields with different competitive positions.
Abstract: An extended latent Dirichlet allocation (LDA) model is presented in this paper for patent competitive intelligence analysis. After part-of-speech tagging and defining the noun phrase extraction rules, technological words have been extracted from patent titles and abstracts. This allows us to go one step further and perform patent analysis at content level. Then LDA model is used for identifying underlying topic structures based on latent relationships of technological words extracted. This helped us to review research hot spots and directions in subclasses of patented technology in a certain field. For the extension of the traditional LDA model, another institution-topic probability level is added to the original LDA model. Direct competing enterprises' distribution probability and their technological positions are identified in each topic. Then a case study is carried on within one of the core patented technology in next generation telecommunication technology-LTE. This empirical study reveals emerging hot spots of LTE technology, and finds that major companies in this field have been focused on different technological fields with different competitive positions.

Proceedings ArticleDOI
03 Nov 2014
TL;DR: Wang et al. as mentioned in this paper proposed an LDA-based opinion model named Twitter Opinion Topic Model (TOTM) for opinion mining and sentiment analysis, which leverages hashtags, mentions, emoticons and strong sentiment words that are present in tweets in its discovery process.
Abstract: Aspect-based opinion mining is widely applied to review data to aggregate or summarize opinions of a product, and the current state-of-the-art is achieved with Latent Dirichlet Allocation (LDA)-based model. Although social media data like tweets are laden with opinions, their "dirty" nature (as natural language) has discouraged researchers from applying LDA-based opinion model for product review mining. Tweets are often informal, unstructured and lacking labeled data such as categories and ratings, making it challenging for product opinion mining. In this paper, we propose an LDA-based opinion model named Twitter Opinion Topic Model (TOTM) for opinion mining and sentiment analysis. TOTM leverages hashtags, mentions, emoticons and strong sentiment words that are present in tweets in its discovery process. It improves opinion prediction by modeling the target-opinion interaction directly, thus discovering target specific opinion words, neglected in existing approaches. Moreover, we propose a new formulation of incorporating sentiment prior information into a topic model, by utilizing an existing public sentiment lexicon. This is novel in that it learns and updates with the data. We conduct experiments on 9 million tweets on electronic products, and demonstrate the improved performance of TOTM in both quantitative evaluations and qualitative analysis. We show that aspect-based opinion analysis on massive volume of tweets provides useful opinions on products.

Journal ArticleDOI
TL;DR: An unsupervised dependency analysis-based approach is presented to extract Appraisal Expression Patterns (AEPs) from reviews, which represent the manner in which people express opinions regarding products or services and can be regarded as a condensed representation of the syntactic relationship between aspect and sentiment words.
Abstract: With the considerable growth of user-generated content, online reviews are becoming extremely valuable sources for mining customers' opinions on products and services However, most of the traditional opinion mining methods are coarse-grained and cannot understand natural languages Thus, aspect-based opinion mining and summarization are of great interest in academic and industrial research In this paper, we study an approach to extract product and service aspect words, as well as sentiment words, automatically from reviews An unsupervised dependency analysis-based approach is presented to extract Appraisal Expression Patterns (AEPs) from reviews, which represent the manner in which people express opinions regarding products or services and can be regarded as a condensed representation of the syntactic relationship between aspect and sentiment words AEPs are high-level, domain-independent types of information, and have excellent domain adaptability An AEP-based Latent Dirichlet Allocation (AEP-LDA) model is also proposed This is a sentence-level, probabilistic generative model which assumes that all words in a sentence are drawn from one topic – a generally true assumption, based on our observation The model also assumes that every review corpus is composed of several mutually corresponding aspect and sentiment topics, as well as a background word topic The AEP information is incorporated into the AEP-LDA model for mining aspect and sentiment words simultaneously The experimental results on reviews of restaurants, hotels, MP3 players, and cameras show that the AEP-LDA model outperforms other approaches in identifying aspect and sentiment words

Journal ArticleDOI
Shulong Tan1, Jiajun Bu1, Xuzhen Qin1, Chun Chen1, Deng Cai1 
TL;DR: This paper proposes a Bayesian hierarchical approach based on Latent Dirichlet Allocation (LDA) to transfer user interests cross domains or media, and combines multi-type media information: media descriptions, user-generated text data and ratings.

Proceedings Article
23 Jul 2014
TL;DR: In this article, Bayesian optimization for constrained problems is studied in the general case that noise may be present in the constraint functions, and the objective and constraints may be evaluated independently.
Abstract: Recent work on Bayesian optimization has shown its effectiveness in global optimization of difficult black-box objective functions. Many real-world optimization problems of interest also have constraints which are unknown a priori. In this paper, we study Bayesian optimization for constrained problems in the general case that noise may be present in the constraint functions, and the objective and constraints may be evaluated independently. We provide motivating practical examples, and present a general framework to solve such problems. We demonstrate the effectiveness of our approach on optimizing the performance of online latent Dirichlet allocation subject to topic sparsity constraints, tuning a neural network given test-time memory constraints, and optimizing Hamiltonian Monte Carlo to achieve maximal effectiveness in a fixed time, subject to passing standard convergence diagnostics.

Posted Content
TL;DR: In this article, the authors perform a systematic theoretical and numerical analysis that demonstrates that current optimization techniques for LDA often yield results which are not accurate in inferring the most suitable model parameters.
Abstract: Much of human knowledge sits in large databases of unstructured text. Leveraging this knowledge requires algorithms that extract and record metadata on unstructured text documents. Assigning topics to documents will enable intelligent search, statistical characterization, and meaningful classification. Latent Dirichlet allocation (LDA) is the state-of-the-art in topic classification. Here, we perform a systematic theoretical and numerical analysis that demonstrates that current optimization techniques for LDA often yield results which are not accurate in inferring the most suitable model parameters. Adapting approaches for community detection in networks, we propose a new algorithm which displays high-reproducibility and high-accuracy, and also has high computational efficiency. We apply it to a large set of documents in the English Wikipedia and reveal its hierarchical structure. Our algorithm promises to make "big data" text analysis systems more reliable.

Journal ArticleDOI
TL;DR: A variational approach for fitting the mixture of latent trait models is developed and it is shown to yield intuitive clustering results and it gives a much better fit than either latent class analysis or latent trait analysis alone.
Abstract: Model-based clustering methods for continuous data are well established and commonly used in a wide range of applications. However, model-based clustering methods for categorical data are less standard. Latent class analysis is a commonly used method for model-based clustering of binary data and/or categorical data, but due to an assumed local independence structure there may not be a correspondence between the estimated latent classes and groups in the population of interest. The mixture of latent trait analyzers model extends latent class analysis by assuming a model for the categorical response variables that depends on both a categorical latent class and a continuous latent trait variable; the discrete latent class accommodates group structure and the continuous latent trait accommodates dependence within these groups. Fitting the mixture of latent trait analyzers model is potentially difficult because the likelihood function involves an integral that cannot be evaluated analytically. We develop a variational approach for fitting the mixture of latent trait models and this provides an efficient model fitting strategy. The mixture of latent trait analyzers model is demonstrated on the analysis of data from the National Long Term Care Survey (NLTCS) and voting in the U.S. Congress. The model is shown to yield intuitive clustering results and it gives a much better fit than either latent class analysis or latent trait analysis alone.