Latent dirichlet allocation

doi:10.5555/944919.944937

Home
/
Papers
/
Latent dirichlet allocation

Journal Article•DOI•

Latent dirichlet allocation

David M. Blei¹, Andrew Y. Ng², Michael I. Jordan¹•Institutions (2)

University of California, Berkeley¹, Stanford University²

01 Mar 2003-Journal of Machine Learning Research (JMLR.org)-Vol. 3, pp 993-1022

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.

read less

Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

[...]

Svetlana Lazebnik¹, Cordelia Schmid², Jean Ponce³•Institutions (3)

University of Illinois at Urbana–Champaign¹, French Institute for Research in Computer Science and Automation², École Normale Supérieure³

17 Jun 2006

TL;DR: This paper presents a method for recognizing scene categories based on approximate global geometric correspondence that exceeds the state of the art on the Caltech-101 database and achieves high accuracy on a large database of fifteen natural scene categories.

...read moreread less

Abstract: This paper presents a method for recognizing scene categories based on approximate global geometric correspondence. This technique works by partitioning the image into increasingly fine sub-regions and computing histograms of local features found inside each sub-region. The resulting "spatial pyramid" is a simple and computationally efficient extension of an orderless bag-of-features image representation, and it shows significantly improved performance on challenging scene categorization tasks. Specifically, our proposed method exceeds the state of the art on the Caltech-101 database and achieves high accuracy on a large database of fifteen natural scene categories. The spatial pyramid framework also offers insights into the success of several recently proposed image descriptions, including Torralbas "gist" and Lowes SIFT descriptors.

...read moreread less

8,736 citations

Cites background from "Latent dirichlet allocation"

...To verify this, we have experimented with probabilistic latent semantic analysis (pLSA) [7], which attempts to explain the distribution of features in the image as a mixture of a few “scene topics” or “aspects” and performs very similarly to LDA in practice [17]....
[...]
...We conjecture that Li and Perona’s approach is disadvantaged by its reliance on latent Dirichlet allocation (LDA) [2], which is essentially an unsupervised dimensionality reduction technique and as such, is not necessarily conducive to achieving the highest classification accuracy....
[...]

Book•

Opinion Mining and Sentiment Analysis

[...]

Bo Pang¹, Lillian Lee²•Institutions (2)

Yahoo!¹, Cornell University²

08 Jul 2008

TL;DR: This survey covers techniques and approaches that promise to directly enable opinion-oriented information-seeking systems and focuses on methods that seek to address the new challenges raised by sentiment-aware applications, as compared to those that are already present in more traditional fact-based analysis.

...read moreread less

Abstract: An important part of our information-gathering behavior has always been to find out what other people think. With the growing availability and popularity of opinion-rich resources such as online review sites and personal blogs, new opportunities and challenges arise as people now can, and do, actively use information technologies to seek out and understand the opinions of others. The sudden eruption of activity in the area of opinion mining and sentiment analysis, which deals with the computational treatment of opinion, sentiment, and subjectivity in text, has thus occurred at least in part as a direct response to the surge of interest in new systems that deal directly with opinions as a first-class object. This survey covers techniques and approaches that promise to directly enable opinion-oriented information-seeking systems. Our focus is on methods that seek to address the new challenges raised by sentiment-aware applications, as compared to those that are already present in more traditional fact-based analysis. We include material on summarization of evaluative text and on broader issues regarding privacy, manipulation, and economic impact that the development of opinion-oriented information-access services gives rise to. To facilitate future work, a discussion of available resources, benchmark datasets, and evaluation campaigns is also provided.

...read moreread less

7,452 citations

Cites background from "Latent dirichlet allocation"

...Research employing probabilistic latent semantic analysis (PLSA) [125] or latent Dirichlet allocation (LDA) [39] can also be cast as language-modeling work [41, 194, 206]....
[...]

Journal Article•DOI•

Data clustering: 50 years beyond K-means

[...]

Anil K. Jain¹•Institutions (1)

Michigan State University¹

01 Jun 2010

TL;DR: A brief overview of clustering is provided, well known clustering methods are summarized, the major challenges and key issues in designing clustering algorithms are discussed, and some of the emerging and useful research directions are pointed out.

...read moreread less

Abstract: Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into a system of ranked taxa: domain, kingdom, phylum, class, etc. Cluster analysis is the formal study of methods and algorithms for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is to find structure in data and is therefore exploratory in nature. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, K-means, was first published in 1955. In spite of the fact that K-means was proposed over 50 years ago and thousands of clustering algorithms have been published since then, K-means is still widely used. This speaks to the difficulty in designing a general purpose clustering algorithm and the ill-posed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semi-supervised clustering, ensemble clustering, simultaneous feature selection during data clustering, and large scale data clustering.

...read moreread less

6,601 citations

Journal Article•DOI•

Probabilistic topic models

[...]

David M. Blei¹•Institutions (1)

Princeton University¹

01 Apr 2012-Communications of The ACM

TL;DR: Surveying a suite of algorithms that offer a solution to managing large document archives suggests they are well-suited to handle large amounts of data.

...read moreread less

Abstract: Probabilistic topic modeling provides a suite of tools for the unsupervised analysis of large collections of documents. Topic modeling algorithms can uncover the underlying themes of a collection and decompose its documents according to those themes. This analysis can be used for corpus exploration, document search, and a variety of prediction problems.In this tutorial, I will review the state-of-the-art in probabilistic topic models. I will describe the three components of topic modeling:(1) Topic modeling assumptions(2) Algorithms for computing with topic models(3) Applications of topic modelsIn (1), I will describe latent Dirichlet allocation (LDA), which is one of the simplest topic models, and then describe a variety of ways that we can build on it. These include dynamic topic models, correlated topic models, supervised topic models, author-topic models, bursty topic models, Bayesian nonparametric topic models, and others. I will also discuss some of the fundamental statistical ideas that are used in building topic models, such as distributions on the simplex, hierarchical Bayesian modeling, and models of mixed-membership.In (2), I will review how we compute with topic models. I will describe approximate posterior inference for directed graphical models using both sampling and variational inference, and I will discuss the practical issues and pitfalls in developing these algorithms for topic models. Finally, I will describe some of our most recent work on building algorithms that can scale to millions of documents and documents arriving in a stream.In (3), I will discuss applications of topic models. These include applications to images, music, social networks, and other data in which we hope to uncover hidden patterns. I will describe some of our recent work on adapting topic modeling algorithms to collaborative filtering, legislative modeling, and bibliometrics without citations.Finally, I will discuss some future directions and open research problems in topic models.

...read moreread less

4,529 citations

Book•

Sentiment Analysis and Opinion Mining

[...]

Bing Liu¹•Institutions (1)

University of Illinois at Chicago¹

01 May 2012

TL;DR: Sentiment analysis and opinion mining is the field of study that analyzes people's opinions, sentiments, evaluations, attitudes, and emotions from written language as discussed by the authors and is one of the most active research areas in natural language processing and is also widely studied in data mining, Web mining, and text mining.

...read moreread less

Abstract: Sentiment analysis and opinion mining is the field of study that analyzes people's opinions, sentiments, evaluations, attitudes, and emotions from written language. It is one of the most active research areas in natural language processing and is also widely studied in data mining, Web mining, and text mining. In fact, this research has spread outside of computer science to the management sciences and social sciences due to its importance to business and society as a whole. The growing importance of sentiment analysis coincides with the growth of social media such as reviews, forum discussions, blogs, micro-blogs, Twitter, and social networks. For the first time in human history, we now have a huge volume of opinionated data recorded in digital form for analysis. Sentiment analysis systems are being applied in almost every business and social domain because opinions are central to almost all human activities and are key influencers of our behaviors. Our beliefs and perceptions of reality, and the choices we make, are largely conditioned on how others see and evaluate the world. For this reason, when we need to make a decision we often seek out the opinions of others. This is true not only for individuals but also for organizations. This book is a comprehensive introductory and survey text. It covers all important topics and the latest developments in the field with over 400 references. It is suitable for students, researchers and practitioners who are interested in social media analysis in general and sentiment analysis in particular. Lecturers can readily use it in class for courses on natural language processing, social media analysis, text mining, and data mining. Lecture slides are also available online.

...read moreread less

4,515 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Multiple Hypergeometric Functions: Probabilistic Interpretations and Statistical Uses

[...]

James M. Dickey¹•Institutions (1)

University at Albany, SUNY¹

01 Sep 1983-Journal of the American Statistical Association

TL;DR: In this paper, a new integral identity is adapted from Carlson to represent the moments of quadratic forms under multivariate normal and, more generally, elliptically contoured distributions, which permits the computation of such moments by simple quadrature.

...read moreread less

Abstract: This article reviews and interprets recent mathematics of special functions, with emphasis on integral representations of multiple hypergeometric functions. B.C. Carlson's centrally important parameterized functions R and ℛ, initially defined as Dirichlet averages, are expressed as probability-generating functions of mixed multinomial distributions. Various nested families generalizing the Dirichlet distributions are developed for Bayesian inference in multinomial sampling and contingency tables. In the case of many-way tables, this motivates a new generalization of the function ℛ. These distributions are also useful for the modeling of populations of personal probabilities evolving under the process of inference from statistical data. A remarkable new integral identity is adapted from Carlson to represent the moments of quadratic forms under multivariate normal and, more generally, elliptically contoured distributions. This permits the computation of such moments by simple quadrature.

...read moreread less

94 citations

"Latent dirichlet allocation" refers background in this paper

...a function which is intractable due to the coupling between θ and β in the summation over latent topics (Dickey, 1983)....
[...]
...(3) in terms of the model parameters: p(w |α,β) = Γ(∑i αi) ∏i Γ(αi) ∫ ( k ∏ i=1 θαi−1i )( N ∏ n=1 k ∑ i=1 V ∏ j=1 (θiβi j )w j n ) dθ, a function which is intractable due to the coupling betweenθ andβ in the summation over latent topics (Dickey, 1983)....
[...]

Journal Article•DOI•

Bayesian Methods for Censored Categorical Data

[...]

James M. Dickey¹, Jhy-Ming Jiang², Joseph B. Kadane³•Institutions (3)

University of Minnesota¹, University of Massachusetts Lowell², Carnegie Mellon University³

01 Sep 1987-Journal of the American Statistical Association

TL;DR: In this article, the posterior moments and predictive probabilities are proportional to ratios of B. C. Carlson's multiple hypergeometric functions, and closed-form expressions are developed for nested reported sets, when Bayesian estimates can be computed easily from relative frequencies.

...read moreread less

Abstract: Bayesian methods are given for finite-category sampling when some of the observations suffer missing category distinctions. Dickey's (1983) generalization of the Dirichlet family of prior distributions is found to be closed under such censored sampling. The posterior moments and predictive probabilities are proportional to ratios of B. C. Carlson's multiple hypergeometric functions. Closed-form expressions are developed for the case of nested reported sets, when Bayesian estimates can be computed easily from relative frequencies. Effective computational methods are also given in the general case. An example involving surveys of death-penalty attitudes is used throughout to illustrate the theory. A simple special case of categorical missing data is a two-way contingency table with cross-classified count data xij (i = 1, …, r; j = 1, …, c), together with supplementary trials counted only in the margin distinguishing the rows, yi (i = 1, …, r). There could also be further supplementary trials report...

...read moreread less

53 citations

"Latent dirichlet allocation" refers methods in this paper

...It has been used in a Bayesian context for censored discrete data to represent the posterior on θ which, in that setting, is a random parameter (Dickey et al., 1987)....
[...]
...It has been used in a Bayesian context for censored discrete data to represent the posterior onθ which, in that setting, is a random parameter (Dickey et al., 1987)....
[...]

Proceedings Article•

Computer generated higher order expansions

[...]

M.A.R. Leisink¹, Hilbert J. Kappen¹•Institutions (1)

Radboud University Nijmegen¹

01 Aug 2002

TL;DR: This article implemented a computer algorithm to generate all necessary analytic terms for the Boltzmann machine partition function thus leading to lower bounds of any order, and it turns out that the extra variational parameters can be optimized analytically.

...read moreread less

Abstract: In this article we show the rough outline of a computer algorithm to generate lower bounds on the exponential function of (in principle) arbitrary precision. We implemented this to generate all necessary analytic terms for the Boltzmann machine partition function thus leading to lower bounds of any order. It turns out that the extra variational parameters can be optimized analytically. We show that bounds upto nineth order are still reasonably calculable in practical situations. The generated terms can also be used as extra correction terms (beyond TAP)in mean field expansions.

...read moreread less

6 citations

Additional excerpts

...In particular, Leisink and Kappen (2002) have presented a general methodology for converting low-order variational lower bounds into higher-order variational bounds....
[...]