scispace - formally typeset
Search or ask a question

Showing papers on "Latent Dirichlet allocation published in 2013"


Journal ArticleDOI
TL;DR: Stochastic variational inference lets us apply complex Bayesian models to massive data sets, and it is shown that the Bayesian nonparametric topic model outperforms its parametric counterpart.
Abstract: We develop stochastic variational inference, a scalable algorithm for approximating posterior distributions. We develop this technique for a large class of probabilistic models and we demonstrate it with two probabilistic topic models, latent Dirichlet allocation and the hierarchical Dirichlet process topic model. Using stochastic variational inference, we analyze several large collections of documents: 300K articles from Nature, 1.8M articles from The New York Times, and 3.8M articles from Wikipedia. Stochastic inference can easily handle data sets of this size and outperforms traditional variational inference, which can only handle a smaller subset. (We also show that the Bayesian nonparametric topic model outperforms its parametric counterpart.) Stochastic variational inference lets us apply complex Bayesian models to massive data sets.

2,291 citations


Journal ArticleDOI
01 Dec 2013-Poetics
TL;DR: This paper used Latent Dirichlet Allocation (LDA) to analyze how one such policy domain, government assistance to artists and arts organizations, was framed in almost 8000 articles published in five U.S. newspapers between 1986 and 1997.

653 citations


Proceedings ArticleDOI
28 Jul 2013
TL;DR: This paper empirically establishes that a novel method of tweet pooling by hashtags leads to a vast improvement in a variety of measures for topic coherence across three diverse Twitter datasets in comparison to an unmodified LDA baseline and a range of pooling schemes.
Abstract: Twitter, or the world of 140 characters poses serious challenges to the efficacy of topic models on short, messy text. While topic models such as Latent Dirichlet Allocation (LDA) have a long history of successful application to news articles and academic abstracts, they are often less coherent when applied to microblog content like Twitter. In this paper, we investigate methods to improve topics learned from Twitter content without modifying the basic machinery of LDA; we achieve this through various pooling schemes that aggregate tweets in a data preprocessing step for LDA. We empirically establish that a novel method of tweet pooling by hashtags leads to a vast improvement in a variety of measures for topic coherence across three diverse Twitter datasets in comparison to an unmodified LDA baseline and a variety of pooling schemes. An additional contribution of automatic hashtag labeling further improves on the hashtag pooling results for a subset of metrics. Overall, these two novel schemes lead to significantly improved LDA topic models on Twitter content.

475 citations


Proceedings ArticleDOI
18 May 2013
TL;DR: This work designs MARA (Mobile App Review Analyzer), a prototype for automatic retrieval of mobile app feature requests from online reviews, which is informed by an investigation of the ways users express feature requests through reviews.
Abstract: Mobile app reviews are valuable repositories of ideas coming directly from app users. Such ideas span various topics, and in this paper we show that 23.3% of them represent feature requests, i.e. comments through which users either suggest new features for an app or express preferences for the re-design of already existing features of an app. One of the challenges app developers face when trying to make use of such feedback is the massive amount of available reviews. This makes it difficult to identify specific topics and recurring trends across reviews. Through this work, we aim to support such processes by designing MARA (Mobile App Review Analyzer), a prototype for automatic retrieval of mobile app feature requests from online reviews. The design of the prototype is a) informed by an investigation of the ways users express feature requests through reviews, b) developed around a set of pre-defined linguistic rules, and c) evaluated on a large sample of online reviews. The results of the evaluation were further analyzed using Latent Dirichlet Allocation for identifying common topics across feature requests, and the results of this analysis are reported in this paper.

322 citations


Posted Content
TL;DR: SDA-Bayes as mentioned in this paper is a framework for streaming and distributed computation of a Bayesian posterior, which makes streaming updates to the estimated posterior according to a user-specified approximation batch primitive.
Abstract: We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-specified approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two large-scale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data---a case where SVI may be applied---and in the streaming setting, where SVI does not apply.

291 citations


Proceedings ArticleDOI
18 May 2013
TL;DR: A novel solution to adapt, configure and effectively use a topic modeling technique, namely Latent Dirichlet Allocation (LDA), to achieve better (acceptable) performance across various SE tasks is proposed.
Abstract: Information Retrieval (IR) methods, and in particular topic models, have recently been used to support essential software engineering (SE) tasks, by enabling software textual retrieval and analysis. In all these approaches, topic models have been used on software artifacts in a similar manner as they were used on natural language documents (e.g., using the same settings and parameters) because the underlying assumption was that source code and natural language documents are similar. However, applying topic models on software data using the same settings as for natural language text did not always produce the expected results. Recent research investigated this assumption and showed that source code is much more repetitive and predictable as compared to the natural language text. Our paper builds on this new fundamental finding and proposes a novel solution to adapt, configure and effectively use a topic modeling technique, namely Latent Dirichlet Allocation (LDA), to achieve better (acceptable) performance across various SE tasks. Our paper introduces a novel solution called LDA-GA, which uses Genetic Algorithms (GA) to determine a near-optimal configuration for LDA in the context of three different SE tasks: (1) traceability link recovery, (2) feature location, and (3) software artifact labeling. The results of our empirical studies demonstrate that LDA-GA is ableto identify robust LDA configurations, which lead to a higher accuracy on all the datasets for these SE tasks as compared to previously published results, heuristics, and the results of a combinatorial search.

272 citations


Journal ArticleDOI
TL;DR: This work proposes a reliable and flexible visual analytics system for topic modeling called UTOPIAN (User-driven Topic modeling based on Interactive Nonnegative Matrix Factorization), which enables users to interact with the topic modeling method and steer the result in a user-driven manner.
Abstract: Topic modeling has been widely used for analyzing text document collections. Recently, there have been significant advancements in various topic modeling techniques, particularly in the form of probabilistic graphical modeling. State-of-the-art techniques such as Latent Dirichlet Allocation (LDA) have been successfully applied in visual text analytics. However, most of the widely-used methods based on probabilistic modeling have drawbacks in terms of consistency from multiple runs and empirical convergence. Furthermore, due to the complicatedness in the formulation and the algorithm, LDA cannot easily incorporate various types of user feedback. To tackle this problem, we propose a reliable and flexible visual analytics system for topic modeling called UTOPIAN (User-driven Topic modeling based on Interactive Nonnegative Matrix Factorization). Centered around its semi-supervised formulation, UTOPIAN enables users to interact with the topic modeling method and steer the result in a user-driven manner. We demonstrate the capability of UTOPIAN via several usage scenarios with real-world document corpuses such as InfoVis/VAST paper data set and product review data sets.

252 citations


Proceedings Article
05 Dec 2013
TL;DR: A new method, Stochastic gradient Riemannian Langevin dynamics, which is simple to implement and can be applied to large scale data is proposed and achieves substantial performance improvements over the state of the art online variational Bayesian methods.
Abstract: In this paper we investigate the use of Langevin Monte Carlo methods on the probability simplex and propose a new method, Stochastic gradient Riemannian Langevin dynamics, which is simple to implement and can be applied to large scale data. We apply this method to latent Dirichlet allocation in an online mini-batch setting, and demonstrate that it achieves substantial performance improvements over the state of the art online variational Bayesian methods.

245 citations


Journal ArticleDOI
TL;DR: This paper proposes an unsupervised approach to automatically discover the aspects discussed in Chinese social reviews and also the sentiments expressed in different aspects, and applies the Latent Dirichlet Allocation model to discover multi-aspect global topics of social reviews.
Abstract: User-generated reviews on the Web reflect users' sentiment about products, services and social events. Existing researches mostly focus on the sentiment classification of the product and service reviews in document level. Reviews of social events such as economic and political activities, which are called social reviews, have specific characteristics different to the reviews of products and services. In this paper, we propose an unsupervised approach to automatically discover the aspects discussed in Chinese social reviews and also the sentiments expressed in different aspects. The approach is called Multi-aspect Sentiment Analysis for Chinese Online Social Reviews (MSA-COSRs). We first apply the Latent Dirichlet Allocation (LDA) model to discover multi-aspect global topics of social reviews, and then extract the local topic and associated sentiment based on a sliding window context over the review text. The aspect of the local topic is identified by a trained LDA model, and the polarity of the associated sentiment is classified by HowNet lexicon. The experiment results show that MSA-COSR cannot only obtain good topic partitioning results, but also help to improve sentiment analysis accuracy. It helps to simultaneously discover multi-aspect fine-grained topics and associated sentiment.

236 citations


Proceedings ArticleDOI
13 May 2013
TL;DR: This paper proposes a novel method for unsupervised and content-based hashtag recommendation for tweets that relies on Latent Dirichlet Allocation (LDA) to model the underlying topic assignment of language classified tweets.
Abstract: Since the introduction of microblogging services, there has been a continuous growth of short-text social networking on the Internet. With the generation of large amounts of microposts, there is a need for effective categorization and search of the data. Twitter, one of the largest microblogging sites, allows users to make use of hashtags to categorize their posts. However, the majority of tweets do not contain tags, which hinders the quality of the search results. In this paper, we propose a novel method for unsupervised and content-based hashtag recommendation for tweets. Our approach relies on Latent Dirichlet Allocation (LDA) to model the underlying topic assignment of language classified tweets. The advantage of our approach is the use of a topic distribution to recommend general hashtags.

211 citations


11 Sep 2013
TL;DR: This work presents a new way of picking words to represent a topic, and presents a novel method for interactive topic modeling that allows the user to give live feedback on the topics, and allows the inference algorithm to use that feedback to guide the LDA parameter search.
Abstract: Topics discovered by the latent Dirichlet allocation (LDA) method are sometimes not meaningful for humans. The goal of our work is to improve the quality of topics presented to end-users. Our contributions are two-fold. First, we present a new way of picking words to represent a topic. Instead of simply selecting the top words by frequency, by penalizing words that are shared across multiple topics, we down-weight background words and reveal what is specific about each topic. Second, we present a novel method for interactive topic modeling. The method allows the user to give live feedback on the topics, and allows the inference algorithm to use that feedback to guide the LDA parameter search. The user can indicate that words should be removed from a topic, that topics should be merged, and/or that a topic should be split, or deleted. After each item of user feedback, we change the internal state of the variational EM algorithm in a way that preserves correctness, then re-run the algorithm until convergence. Experiments show that both contributions are successful in practice

Proceedings ArticleDOI
28 Jul 2013
TL;DR: This paper describes a method that accounts for nascent information culled from Twitter to provide relevant recommendation in such cold-start situations and significantly outperforms other state-of-the-art recommendation techniques by up to 33%.
Abstract: As a tremendous number of mobile applications (apps) are readily available, users have difficulty in identifying apps that are relevant to their interests. Recommender systems that depend on previous user ratings (i.e., collaborative filtering, or CF) can address this problem for apps that have sufficient ratings from past users. But for apps that are newly released, CF does not have any user ratings to base recommendations on, which leads to the cold-start problem. In this paper, we describe a method that accounts for nascent information culled from Twitter to provide relevant recommendation in such cold-start situations. We use Twitter handles to access an app's Twitter account and extract the IDs of their Twitter-followers. We create pseudo-documents that contain the IDs of Twitter users interested in an app and then apply latent Dirichlet allocation to generate latent groups. At test time, a target user seeking recommendations is mapped to these latent groups. By using the transitive relationship of latent groups to apps, we estimate the probability of the user liking the app. We show that by incorporating information from Twitter, our approach overcomes the difficulty of cold-start app recommendation and significantly outperforms other state-of-the-art recommendation techniques by up to 33%.

Proceedings ArticleDOI
Bin Liu1, Hui Xiong1
01 Jan 2013
TL;DR: A Topic and Location-aware probabilistic matrix factorization (TL-PMF) method is proposed for POI recommendation to consider both the extent to which a user interest matches the POI in terms of topic distribution and the word-of-mouth opinions of the POIs.
Abstract: The wide spread use of location based social networks (LBSNs) has enabled the opportunities for better location based services through Point-of-Interest (POI) recommendation. Indeed, the problem of POI recommendation is to provide personalized recommendations of places of interest. Unlike traditional recommendation tasks, POI recommendation is personalized, locationaware, and context depended. In light of this difference, this paper proposes a topic and location aware POI recommender system by exploiting associated textual and context information. Specifically, we first exploit an aggregated latent Dirichlet allocation (LDA) model to learn the interest topics of users and to infer the interest POIs by mining textual information associated with POIs. Then, a Topic and Location-aware probabilistic matrix factorization (TL-PMF) method is proposed for POI recommendation. A unique perspective of TL-PMF is to consider both the extent to which a user interest matches the POI in terms of topic distribution and the word-of-mouth opinions of the POIs. Finally, experiments on real-world LBSNs data show that the proposed recommendation method outperforms state-of-the-art probabilistic latent factor models with a significant margin. Also, we have studied the impact of personalized interest topics and word-of-mouth opinions on POI recommendations.

Journal ArticleDOI
TL;DR: The techniques used in this study provide a possible toolset for computational social scientists in general, and health researchers in specific, to better understand health problems from large conversational datasets.
Abstract: Public health related tweets are difficult to identify in large conversational datasets like Twitter.com. Even more challenging is the visualization and analyses of the spatial patterns encoded in tweets. This study has the following objectives: how can topic modeling be used to identify relevant public health topics such as obesity on Twitter.com? What are the common obesity related themes? What is the spatial pattern of the themes? What are the research challenges of using large conversational datasets from social networking sites? Obesity is chosen as a test theme to demonstrate the effectiveness of topic modeling using Latent Dirichlet Allocation (LDA) and spatial analysis using Geographic Information System (GIS). The dataset is constructed from tweets (originating from the United States) extracted from Twitter.com on obesity-related queries. Examples of such queries are ‘food deserts’, ‘fast food’, and ‘childhood obesity’. The tweets are also georeferenced and time stamped. Three cohesive and meanin...

Proceedings Article
01 Jan 2013
TL;DR: The Structural Topic Model (STM), a general way to incorporate corpus structure or document metadata into the standard topic model, is developed which accommodates corpus structure through document-level covariates affecting topical prevalence and/or topical content.
Abstract: We develop the Structural Topic Model which provides a general way to incorporate corpus structure or document metadata into the standard topic model. Document-level covariates enter the model through a simple generalized linear model framework in the prior distributions controlling either topical prevalence or topical content. We demonstrate the model’s use in two applied problems: the analysis of open-ended responses in a survey experiment about immigration policy, and understanding differing media coverage of China’s rise. 1 Topic Models and Social Science Over the last decade probabilistic topic models, such as Latent Dirichlet Allocation (LDA), have become a common tool for understanding large text corpora [1].1 Although originally developed for descriptive and exploratory purposes, social scientists are increasingly seeing the value of topic models as a tool for measurement of latent linguistic, political and psychological variables [2]. The defining element of this work is the presence of additional document-level information (e.g. author, partisan affiliation, date) on which variation in either topical prevalence or topical content is of theoretic interest.2 As a practical matter, this generally involves running an off-the-shelf implementation of LDA and then performing a post-hoc evaluation of variation with a covariate of interest. A better alternative to post-hoc comparisons is to build the additional information about the structure of the corpus into the model itself by altering the prior distributions to partially pool information amongst similar documents. Numerous special cases of this framework have been developed for particular types of corpus structure affecting both topic prevalence (e.g. time [3], author [4]) and topical content (e.g. ideology [5], geography [6]). Applied users have been slow to adopt these models because it is often difficult to find a model that exactly fits their specific corpus. We develop the Structural Topic Model (STM) which accommodates corpus structure through document-level covariates affecting topical prevalence and/or topical content. The central idea is to ∗Prepared for the NIPS 2013 Workshop on Topic Models: Computation, Application, and Evaluation. A forthcoming R package implements the methods described here. † These authors contributed equally. We assume a general familiarity with LDA throughout (see [1] for a review) By “topical prevalence” we mean the proportion of document devoted to a given topic. By “topical content” we mean the rate of word use within a given topic.

Proceedings ArticleDOI
04 Feb 2013
TL;DR: In this paper, a graph-based approach for topic labeling is proposed, based on the DBpedia graph, which identifies the concepts that best represent the topics and uses graph centrality measures to identify the topics.
Abstract: Automated topic labelling brings benefits for users aiming at analysing and understanding document collections, as well as for search engines targetting at the linkage between groups of words and their inherent topics. Current approaches to achieve this suffer in quality, but we argue their performances might be improved by setting the focus on the structure in the data. Building upon research for concept disambiguation and linking to DBpedia, we are taking a novel approach to topic labelling by making use of structured data exposed by DBpedia. We start from the hypothesis that words co-occuring in text likely refer to concepts that belong closely together in the DBpedia graph. Using graph centrality measures, we show that we are able to identify the concepts that best represent the topics. We comparatively evaluate our graph-based approach and the standard text-based approach, on topics extracted from three corpora, based on results gathered in a crowd-sourcing experiment. Our research shows that graph-based analysis of DBpedia can achieve better results for topic labelling in terms of both precision and topic coverage.

Proceedings Article
05 Dec 2013
TL;DR: SDA-Bayes is presented, a framework for streaming updates to the estimated posterior of a Bayesian posterior, with variational Bayes (VB) as the primitive, and the usefulness of the framework is demonstrated by fitting the latent Dirichlet allocation model to two large-scale document collections.
Abstract: We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-specified approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two large-scale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data—a case where SVI may be applied—and in the streaming setting, where SVI does not apply.

Proceedings ArticleDOI
11 Aug 2013
TL;DR: This article proposed a stochastic algorithm for collapsed variational Bayesian inference for LDA, which is simpler and more efficient than the state-of-the-art method, and showed that the algorithm converges faster and often to a better solution than previous methods.
Abstract: There has been an explosion in the amount of digital text information available in recent years, leading to challenges of scale for traditional inference algorithms for topic models. Recent advances in stochastic variational inference algorithms for latent Dirichlet allocation (LDA) have made it feasible to learn topic models on very large-scale corpora, but these methods do not currently take full advantage of the collapsed representation of the model. We propose a stochastic algorithm for collapsed variational Bayesian inference for LDA, which is simpler and more efficient than the state of the art method. In experiments on large-scale text corpora, the algorithm was found to converge faster and often to a better solution than previous methods. Human-subject experiments also demonstrated that the method can learn coherent topics in seconds on small corpora, facilitating the use of topic models in interactive document analysis software.

Proceedings ArticleDOI
18 Mar 2013
TL;DR: Latent Dirichlet Allocation (LDA), a well known topic modeling approach, is used to analyze the contents of tens of thousands of questions and answers, and LDA provides an alternative perspective different from that of Treude et al. for categorizing StackOverflow questions.
Abstract: StackOverflow provides a popular platform where developers post and answer questions. Recently, Treude et al. manually label 385 questions in StackOverflow and group them into 10 categories based on their contents. They also analyze how tags are used in StackOverflow. In this study, we extend their work to obtain a deeper understanding on how developers interact with one another on such a question and answer web site. First, we analyze the distributions of developers who ask and answer questions. We also investigate if there is a segregation of the StackOverflow community into questioners and answerers. We also perform automated text mining to find the various kinds of topics asked by developers. We use Latent Dirichlet Allocation (LDA), a well known topic modeling approach, to analyze the contents of tens of thousands of questions and answers, and produce five topics. Our topic modeling strategy provides an alternative perspective different from that of Treude et al. for categorizing StackOverflow questions. Each question can now be categorized into several topics with different probabilities, and the learned topic model could automatically assign a new question to several categories with varying probabilities. Last but not least, we show the distributions of questions and developers belonging to various topics generated by LDA.

Book ChapterDOI
02 Dec 2013
TL;DR: This paper proposes a novel approach that seamlessly integrates tagging data and WSDL documents through augmented Latent Dirichlet Allocation LDA and develops three strategies to preprocess tagging data before being integrated into the LDA framework for clustering.
Abstract: Clustering Web services that groups together services with similar functionalities helps improve both the accuracy and efficiency of the Web service search engines. An important limitation of existing Web service clustering approaches is that they solely focus on utilizing WSDL Web Service Description Language documents. There has been a recent trend of using user-contributed tagging data to improve the performance of service clustering. Nonetheless, these approaches fail to completely leverage the information carried by the tagging data and hence only trivially improve the clustering performance. In this paper, we propose a novel approach that seamlessly integrates tagging data and WSDL documents through augmented Latent Dirichlet Allocation LDA. We also develop three strategies to preprocess tagging data before being integrated into the LDA framework for clustering. Comprehensive experiments based on real data and the implementation of a Web service search engine demonstrate the effectiveness of the proposed LDA-based service clustering approach.

Proceedings Article
01 Oct 2013
TL;DR: It is shown that straightforward topic modeling using Latent Dirichlet Allocation yields interpretable, psychologically relevant “themes” that add value in prediction of clinical assessments.
Abstract: We investigate the value-add of topic modeling in text analysis for depression, and for neuroticism as a strongly associated personality measure. Using Pennebaker’s Linguistic Inquiry and Word Count (LIWC) lexicon to provide baseline features, we show that straightforward topic modeling using Latent Dirichlet Allocation (LDA) yields interpretable, psychologically relevant “themes” that add value in prediction of clinical assessments.

Proceedings ArticleDOI
27 Apr 2013
TL;DR: Using Latent Dirichlet Allocation (LDA), this work identifies topics from more than half a million Facebook status updates and determines which topics are more likely to receive feedback, such as likes and comments.
Abstract: Although both men and women communicate frequently on Facebook, we know little about what they talk about, whether their topics differ and how their network responds. Using Latent Dirichlet Allocation (LDA), we identify topics from more than half a million Facebook status updates and determine which topics are more likely to receive feedback, such as likes and comments. Women tend to share more personal topics (e.g., family matters), while men discuss more public ones (e.g., politics and sports). Generally, women receive more feedback than men, but "male" topics (those more often posted by men) receive more feedback, especially when posted by women.

Proceedings ArticleDOI
16 Jun 2013
TL;DR: This work develops an adaptive learning rate for stochastic variational inference, which requires no tuning and is easily implemented with computations already made in the algorithm.
Abstract: Stochastic variational inference finds good posterior approximations of probabilistic models with very large data sets. It optimizes the variational objective with stochastic optimization, following noisy estimates of the natural gradient. Operationally, stochastic inference iteratively subsamples from the data, analyzes the subsample, and updates parameters with a decreasing learning rate. However, the algorithm is sensitive to that rate, which usually requires hand-tuning to each application. We solve this problem by developing an adaptive learning rate for stochastic variational inference. Our method requires no tuning and is easily implemented with computations already made in the algorithm. We demonstrate our approach with latent Dirichlet allocation applied to three large text corpora. Inference with the adaptive learning rate converges faster and to a better approximation than the best settings of hand-tuned rates.

Journal ArticleDOI
TL;DR: The css-LDA model, an LDA model with class supervision at the level of image features, is shown to combine the labeling strength of topic-supervision with the flexibility of topics-discovery, and to outperform existing LDA-based image classification approaches.
Abstract: Two new extensions of latent Dirichlet allocation (LDA), denoted topic-supervised LDA (ts-LDA) and class-specific-simplex LDA (css-LDA), are proposed for image classification. An analysis of the supervised LDA models currently used for this task shows that the impact of class information on the topics discovered by these models is very weak in general. This implies that the discovered topics are driven by general image regularities, rather than the semantic regularities of interest for classification. To address this, ts--LDA models are introduced which replace the automated topic discovery of LDA with specified topics, identical to the classes of interest for classification. While this results in improvements in classification accuracy over existing LDA models, it compromises the ability of LDA to discover unanticipated structure of interest. This limitation is addressed by the introduction of css-LDA, an LDA model with class supervision at the level of image features. In css-LDA topics are discovered per class, i.e., a single set of topics shared across classes is replaced by multiple class-specific topic sets. The css-LDA model is shown to combine the labeling strength of topic-supervision with the flexibility of topic-discovery. Its effectiveness is demonstrated through an extensive experimental evaluation, involving multiple benchmark datasets, where it is shown to outperform existing LDA-based image classification approaches.

Proceedings Article
01 Oct 2013
TL;DR: This work improves a two-dimensional multimodal version of Latent Dirichlet Allocation and presents a novel way to integrate visual features into the LDA model using unsupervised clusters of images and provides two novel ways to extend the bimodal models to support three or more modalities.
Abstract: Recent investigations into grounded models of language have shown that holistic views of language and perception can provide higher performance than independent views. In this work, we improve a two-dimensional multimodal version of Latent Dirichlet Allocation (Andrews et al., 2009) in various ways. (1) We outperform text-only models in two different evaluations, and demonstrate that low-level visual features are directly compatible with the existing model. (2) We present a novel way to integrate visual features into the LDA model using unsupervised clusters of images. The clusters are directly interpretable and improve on our evaluation tasks. (3) We provide two novel ways to extend the bimodal models to support three or more modalities. We find that the three-, four-, and five-dimensional models significantly outperform models using only one or two modalities, and that nontextual modalities each provide separate, disjoint knowledge that cannot be forced into a shared, latent structure.

Posted Content
TL;DR: The Structural Topic Model (STM) as mentioned in this paper incorporates document-level information (e.g., author, partisan affiliation, date) into the standard topic model to incorporate corpus structure or document metadata.
Abstract: We develop the Structural Topic Model which provides a general way to incorporate corpus structure or document metadata into the standard topic model. Document-level covariates enter the model through a simple generalized linear model framework in the prior distributions controlling either topical prevalence or topical content. We demonstrate the model’s use in two applied problems: the analysis of open-ended responses in a survey experiment about immigration policy, and understanding differing media coverage of China’s rise. 1 Topic Models and Social Science Over the last decade probabilistic topic models, such as Latent Dirichlet Allocation (LDA), have become a common tool for understanding large text corpora [1].1 Although originally developed for descriptive and exploratory purposes, social scientists are increasingly seeing the value of topic models as a tool for measurement of latent linguistic, political and psychological variables [2]. The defining element of this work is the presence of additional document-level information (e.g. author, partisan affiliation, date) on which variation in either topical prevalence or topical content is of theoretic interest.2 As a practical matter, this generally involves running an off-the-shelf implementation of LDA and then performing a post-hoc evaluation of variation with a covariate of interest. A better alternative to post-hoc comparisons is to build the additional information about the structure of the corpus into the model itself by altering the prior distributions to partially pool information amongst similar documents. Numerous special cases of this framework have been developed for particular types of corpus structure affecting both topic prevalence (e.g. time [3], author [4]) and topical content (e.g. ideology [5], geography [6]). Applied users have been slow to adopt these models because it is often difficult to find a model that exactly fits their specific corpus. We develop the Structural Topic Model (STM) which accommodates corpus structure through document-level covariates affecting topical prevalence and/or topical content. The central idea is to ∗Prepared for the NIPS 2013 Workshop on Topic Models: Computation, Application, and Evaluation. A forthcoming R package implements the methods described here. † These authors contributed equally. We assume a general familiarity with LDA throughout (see [1] for a review) By “topical prevalence” we mean the proportion of document devoted to a given topic. By “topical content” we mean the rate of word use within a given topic.

Proceedings ArticleDOI
13 May 2013
TL;DR: This paper proposes a probabilistic graphical model based on LDA, called Factorized LDA (FLDA), to address the cold start problem and demonstrates the improved effectiveness of the FLDA model in terms of likelihood of the held-out test set.
Abstract: Aspect-based opinion mining from online reviews has attracted a lot of attention recently The main goal of all of the proposed methods is extracting aspects and/or estimating aspect ratings Recent works, which are often based on Latent Dirichlet Allocation (LDA), consider both tasks simultaneously These models are normally trained at the item level, ie, a model is learned for each item separately Learning a model per item is fine when the item has been reviewed extensively and has enough training data However, in real-life data sets such as those from Epinionscom and Amazoncom more than 90% of items have less than 10 reviews, so-called cold start items State-of-the-art LDA models for aspect-based opinion mining are trained at the item level and therefore perform poorly for cold start items due to the lack of sufficient training data In this paper, we propose a probabilistic graphical model based on LDA, called Factorized LDA (FLDA), to address the cold start problem The underlying assumption of FLDA is that aspects and ratings of a review are influenced not only by the item but also by the reviewer It further assumes that both items and reviewers can be modeled by a set of latent factors which represent their aspect and rating distributions Different from state-of-the-art LDA models, FLDA is trained at the category level and learns the latent factors using the reviews of all the items of a category, in particular the non cold start items, and uses them as prior for cold start items Our experiments on three real-life data sets demonstrate the improved effectiveness of the FLDA model in terms of likelihood of the held-out test set We also evaluate the accuracy of FLDA based on two application-oriented measures

Journal ArticleDOI
TL;DR: This research demonstrates that the LDA-based classification scheme tends to outperform the Delta rule, and the @g^2 distance, two classical approaches in authorship attribution based on a restricted number of terms.
Abstract: This paper describes, evaluates and compares the use of Latent Dirichlet allocation (LDA) as an approach to authorship attribution. Based on this generative probabilistic topic model, we can model each document as a mixture of topic distributions with each topic specifying a distribution over words. Based on author profiles (aggregation of all texts written by the same writer) we suggest computing the distance with a disputed text to determine its possible writer. This distance is based on the difference between the two topic distributions. To evaluate different attribution schemes, we carried out an experiment based on 5408 newspaper articles (Glasgow Herald) written by 20 distinct authors. To complement this experiment, we used 4326 articles extracted from the Italian newspaper La Stampa and written by 20 journalists. This research demonstrates that the LDA-based classification scheme tends to outperform the Delta rule, and the @g^2 distance, two classical approaches in authorship attribution based on a restricted number of terms. Compared to the Kullback-Leibler divergence, the LDA-based scheme can provide better effectiveness when considering a larger number of terms.

Proceedings ArticleDOI
11 Aug 2013
TL;DR: An efficient hierarchical document clustering method based on a new algorithm for rank-2 NMF that produces high-quality tree structures in significantly less time compared to other methods such as hierarchical K-means, standard NMF, and latent Dirichlet allocation.
Abstract: Nonnegative matrix factorization (NMF) has been successfully used as a clustering method especially for flat partitioning of documents. In this paper, we propose an efficient hierarchical document clustering method based on a new algorithm for rank-2 NMF. When the two block coordinate descent framework of nonnegative least squares is applied to computing rank-2 NMF, each subproblem requires a solution for nonnegative least squares with only two columns in the matrix. We design the algorithm for rank-2 NMF by exploiting the fact that an exhaustive search for the optimal active set can be performed extremely fast when solving these NNLS problems. In addition, we design a measure based on the results of rank-2 NMF for determining which leaf node should be further split. On a number of text data sets, our proposed method produces high-quality tree structures in significantly less time compared to other methods such as hierarchical K-means, standard NMF, and latent Dirichlet allocation.

Journal ArticleDOI
TL;DR: This paper describes research that seeks to supersede human inductive learning and reasoning in high-level scene understanding and content extraction using the latent Dirichlet allocation model into a finite mixture over an underlying set of topics.
Abstract: This paper describes research that seeks to supersede human inductive learning and reasoning in high-level scene understanding and content extraction. Searching for relevant knowledge with a semantic meaning consists mostly in visual human inspection of the data, regardless of the application. The method presented in this paper is an innovation in the field of information retrieval. It aims to discover latent semantic classes containing pairs of objects characterized by a certain spatial positioning. A hierarchical structure is recommended for the image content. This approach is based on a method initially developed for topics discovery in text, applied this time to invariant descriptors of image region or objects configurations. First, invariant spatial signatures are computed for pairs of objects, based on a measure of their interaction, as attributes for describing spatial arrangements inside the scene. Spatial visual words are then defined through a simple classification, extracting new patterns of similar object configurations. Further, the scene is modeled according to these new patterns (spatial visual words) using the latent Dirichlet allocation model into a finite mixture over an underlying set of topics. In the end, some statistics are done to achieve a better understanding of the spatial distributions inside the discovered semantic classes.