scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Latent dirichlet allocation

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
17 Jun 2006
TL;DR: This paper presents a method for recognizing scene categories based on approximate global geometric correspondence that exceeds the state of the art on the Caltech-101 database and achieves high accuracy on a large database of fifteen natural scene categories.
Abstract: This paper presents a method for recognizing scene categories based on approximate global geometric correspondence. This technique works by partitioning the image into increasingly fine sub-regions and computing histograms of local features found inside each sub-region. The resulting "spatial pyramid" is a simple and computationally efficient extension of an orderless bag-of-features image representation, and it shows significantly improved performance on challenging scene categorization tasks. Specifically, our proposed method exceeds the state of the art on the Caltech-101 database and achieves high accuracy on a large database of fifteen natural scene categories. The spatial pyramid framework also offers insights into the success of several recently proposed image descriptions, including Torralba’s "gist" and Lowe’s SIFT descriptors.

8,736 citations


Cites background from "Latent dirichlet allocation"

  • ...To verify this, we have experimented with probabilistic latent semantic analysis (pLSA) [7], which attempts to explain the distribution of features in the image as a mixture of a few “scene topics” or “aspects” and performs very similarly to LDA in practice [17]....

    [...]

  • ...We conjecture that Li and Perona’s approach is disadvantaged by its reliance on latent Dirichlet allocation (LDA) [2], which is essentially an unsupervised dimensionality reduction technique and as such, is not necessarily conducive to achieving the highest classification accuracy....

    [...]

Book
08 Jul 2008
TL;DR: This survey covers techniques and approaches that promise to directly enable opinion-oriented information-seeking systems and focuses on methods that seek to address the new challenges raised by sentiment-aware applications, as compared to those that are already present in more traditional fact-based analysis.
Abstract: An important part of our information-gathering behavior has always been to find out what other people think. With the growing availability and popularity of opinion-rich resources such as online review sites and personal blogs, new opportunities and challenges arise as people now can, and do, actively use information technologies to seek out and understand the opinions of others. The sudden eruption of activity in the area of opinion mining and sentiment analysis, which deals with the computational treatment of opinion, sentiment, and subjectivity in text, has thus occurred at least in part as a direct response to the surge of interest in new systems that deal directly with opinions as a first-class object. This survey covers techniques and approaches that promise to directly enable opinion-oriented information-seeking systems. Our focus is on methods that seek to address the new challenges raised by sentiment-aware applications, as compared to those that are already present in more traditional fact-based analysis. We include material on summarization of evaluative text and on broader issues regarding privacy, manipulation, and economic impact that the development of opinion-oriented information-access services gives rise to. To facilitate future work, a discussion of available resources, benchmark datasets, and evaluation campaigns is also provided.

7,452 citations


Cites background from "Latent dirichlet allocation"

  • ...Research employing probabilistic latent semantic analysis (PLSA) [125] or latent Dirichlet allocation (LDA) [39] can also be cast as language-modeling work [41, 194, 206]....

    [...]

Journal ArticleDOI
01 Jun 2010
TL;DR: A brief overview of clustering is provided, well known clustering methods are summarized, the major challenges and key issues in designing clustering algorithms are discussed, and some of the emerging and useful research directions are pointed out.
Abstract: Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into a system of ranked taxa: domain, kingdom, phylum, class, etc. Cluster analysis is the formal study of methods and algorithms for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is to find structure in data and is therefore exploratory in nature. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, K-means, was first published in 1955. In spite of the fact that K-means was proposed over 50 years ago and thousands of clustering algorithms have been published since then, K-means is still widely used. This speaks to the difficulty in designing a general purpose clustering algorithm and the ill-posed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semi-supervised clustering, ensemble clustering, simultaneous feature selection during data clustering, and large scale data clustering.

6,601 citations

Journal ArticleDOI
TL;DR: Surveying a suite of algorithms that offer a solution to managing large document archives suggests they are well-suited to handle large amounts of data.
Abstract: Probabilistic topic modeling provides a suite of tools for the unsupervised analysis of large collections of documents. Topic modeling algorithms can uncover the underlying themes of a collection and decompose its documents according to those themes. This analysis can be used for corpus exploration, document search, and a variety of prediction problems.In this tutorial, I will review the state-of-the-art in probabilistic topic models. I will describe the three components of topic modeling:(1) Topic modeling assumptions(2) Algorithms for computing with topic models(3) Applications of topic modelsIn (1), I will describe latent Dirichlet allocation (LDA), which is one of the simplest topic models, and then describe a variety of ways that we can build on it. These include dynamic topic models, correlated topic models, supervised topic models, author-topic models, bursty topic models, Bayesian nonparametric topic models, and others. I will also discuss some of the fundamental statistical ideas that are used in building topic models, such as distributions on the simplex, hierarchical Bayesian modeling, and models of mixed-membership.In (2), I will review how we compute with topic models. I will describe approximate posterior inference for directed graphical models using both sampling and variational inference, and I will discuss the practical issues and pitfalls in developing these algorithms for topic models. Finally, I will describe some of our most recent work on building algorithms that can scale to millions of documents and documents arriving in a stream.In (3), I will discuss applications of topic models. These include applications to images, music, social networks, and other data in which we hope to uncover hidden patterns. I will describe some of our recent work on adapting topic modeling algorithms to collaborative filtering, legislative modeling, and bibliometrics without citations.Finally, I will discuss some future directions and open research problems in topic models.

4,529 citations

Book
01 May 2012
TL;DR: Sentiment analysis and opinion mining is the field of study that analyzes people's opinions, sentiments, evaluations, attitudes, and emotions from written language as discussed by the authors and is one of the most active research areas in natural language processing and is also widely studied in data mining, Web mining, and text mining.
Abstract: Sentiment analysis and opinion mining is the field of study that analyzes people's opinions, sentiments, evaluations, attitudes, and emotions from written language. It is one of the most active research areas in natural language processing and is also widely studied in data mining, Web mining, and text mining. In fact, this research has spread outside of computer science to the management sciences and social sciences due to its importance to business and society as a whole. The growing importance of sentiment analysis coincides with the growth of social media such as reviews, forum discussions, blogs, micro-blogs, Twitter, and social networks. For the first time in human history, we now have a huge volume of opinionated data recorded in digital form for analysis. Sentiment analysis systems are being applied in almost every business and social domain because opinions are central to almost all human activities and are key influencers of our behaviors. Our beliefs and perceptions of reality, and the choices we make, are largely conditioned on how others see and evaluate the world. For this reason, when we need to make a decision we often seek out the opinions of others. This is true not only for individuals but also for organizations. This book is a comprehensive introductory and survey text. It covers all important topics and the latest developments in the field with over 400 references. It is suitable for students, researchers and practitioners who are interested in social media analysis in general and sentiment analysis in particular. Lecturers can readily use it in class for courses on natural language processing, social media analysis, text mining, and data mining. Lecture slides are also available online.

4,515 citations

References
More filters
Journal ArticleDOI
TL;DR: In this paper, a new integral identity is adapted from Carlson to represent the moments of quadratic forms under multivariate normal and, more generally, elliptically contoured distributions, which permits the computation of such moments by simple quadrature.
Abstract: This article reviews and interprets recent mathematics of special functions, with emphasis on integral representations of multiple hypergeometric functions. B.C. Carlson's centrally important parameterized functions R and ℛ, initially defined as Dirichlet averages, are expressed as probability-generating functions of mixed multinomial distributions. Various nested families generalizing the Dirichlet distributions are developed for Bayesian inference in multinomial sampling and contingency tables. In the case of many-way tables, this motivates a new generalization of the function ℛ. These distributions are also useful for the modeling of populations of personal probabilities evolving under the process of inference from statistical data. A remarkable new integral identity is adapted from Carlson to represent the moments of quadratic forms under multivariate normal and, more generally, elliptically contoured distributions. This permits the computation of such moments by simple quadrature.

94 citations


"Latent dirichlet allocation" refers background in this paper

  • ...a function which is intractable due to the coupling between θ and β in the summation over latent topics (Dickey, 1983)....

    [...]

  • ...(3) in terms of the model parameters: p(w |α,β) = Γ(∑i αi) ∏i Γ(αi) ∫ ( k ∏ i=1 θαi−1i )( N ∏ n=1 k ∑ i=1 V ∏ j=1 (θiβi j )w j n ) dθ, a function which is intractable due to the coupling betweenθ andβ in the summation over latent topics (Dickey, 1983)....

    [...]

Journal ArticleDOI
TL;DR: In this article, the posterior moments and predictive probabilities are proportional to ratios of B. C. Carlson's multiple hypergeometric functions, and closed-form expressions are developed for nested reported sets, when Bayesian estimates can be computed easily from relative frequencies.
Abstract: Bayesian methods are given for finite-category sampling when some of the observations suffer missing category distinctions. Dickey's (1983) generalization of the Dirichlet family of prior distributions is found to be closed under such censored sampling. The posterior moments and predictive probabilities are proportional to ratios of B. C. Carlson's multiple hypergeometric functions. Closed-form expressions are developed for the case of nested reported sets, when Bayesian estimates can be computed easily from relative frequencies. Effective computational methods are also given in the general case. An example involving surveys of death-penalty attitudes is used throughout to illustrate the theory. A simple special case of categorical missing data is a two-way contingency table with cross-classified count data xij (i = 1, …, r; j = 1, …, c), together with supplementary trials counted only in the margin distinguishing the rows, yi (i = 1, …, r). There could also be further supplementary trials report...

53 citations


"Latent dirichlet allocation" refers methods in this paper

  • ...It has been used in a Bayesian context for censored discrete data to represent the posterior on θ which, in that setting, is a random parameter (Dickey et al., 1987)....

    [...]

  • ...It has been used in a Bayesian context for censored discrete data to represent the posterior onθ which, in that setting, is a random parameter (Dickey et al., 1987)....

    [...]

Proceedings Article
01 Aug 2002
TL;DR: This article implemented a computer algorithm to generate all necessary analytic terms for the Boltzmann machine partition function thus leading to lower bounds of any order, and it turns out that the extra variational parameters can be optimized analytically.
Abstract: In this article we show the rough outline of a computer algorithm to generate lower bounds on the exponential function of (in principle) arbitrary precision. We implemented this to generate all necessary analytic terms for the Boltzmann machine partition function thus leading to lower bounds of any order. It turns out that the extra variational parameters can be optimized analytically. We show that bounds upto nineth order are still reasonably calculable in practical situations. The generated terms can also be used as extra correction terms (beyond TAP)in mean field expansions.

6 citations


Additional excerpts

  • ...In particular, Leisink and Kappen (2002) have presented a general methodology for converting low-order variational lower bounds into higher-order variational bounds....

    [...]