Latent dirichlet allocation

doi:10.5555/944919.944937

Home
/
Papers
/
Latent dirichlet allocation

Journal Article•DOI•

Latent dirichlet allocation

David M. Blei¹, Andrew Y. Ng², Michael I. Jordan¹•Institutions (2)

University of California, Berkeley¹, Stanford University²

01 Mar 2003-Journal of Machine Learning Research (JMLR.org)-Vol. 3, pp 993-1022

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.

read less

Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Estimates of Segregation and Overlap of Functional Connectivity Networks in the Human Cerebral Cortex

[...]

B.T. Thomas Yeo¹, Fenna M. Krienen², Michael W. L. Chee¹, Randy L. Buckner²•Institutions (2)

National University of Singapore¹, Harvard University²

01 Mar 2014-NeuroImage

TL;DR: The organization of the cerebral cortex was similar regardless of whether a winner-take-all approach or the more relaxed constraints of LDA (or ICA) were imposed, suggesting that large-scale networks may function as partially isolated modules.

...read moreread less

221 citations

Cites background or methods from "Latent dirichlet allocation"

..., 2011) and LDA model (Blei et al., 2003) to both the GSP and HCP group datasets, in order to examine how cortical network organization changes as regions are permitted to participate in multiple networks (Fig....
[...]
...We refer interested readers to Blei et al. (2003) for the probabilistic (and more wellknown) interpretation of LDA....
[...]
...LDA was first introduced in the text mining literature (Blei et al., 2003)....
[...]
...Here,we address the possibility ofmultiple networkmembership by applying latentDirichlet allocation (LDA; Blei et al., 2003) and spatial Independent Component Analysis (ICA; Calhoun et al....
[...]
...First, we applied the mixture model (Yeo et al., 2011) and LDA model (Blei et al., 2003) to both the GSP and HCP group datasets, in order to examine how cortical network organization changes as regions are permitted to participate in multiple networks (Fig....
[...]

Patent•

Method and system for business intelligence analytics on unstructured data

[...]

Aloke Guha, Joan Wrabetz, Shumin Wu, Venky Madireddi

07 Oct 2009

TL;DR: In this article, the authors present a method for business intelligence metrics on unstructured data. But they do not specify how to classify the extracted data and metadata for each document, only that the retrieved data is automatically classified into one or more relevance classes.

...read moreread less

Abstract: Various embodiments of the present invention disclose a method for Business Intelligence (BI) metrics on unstructured data. Unstructured data is collected from numerous data sources that include unstructured data as ingested data. The ingested data is indexed and represents hyperlink and extracted data and metadata for each document. Thereafter, the ingested data is automatically classified into one or more relevance classes. Further, numerous analytics are performed on the classified data to generate business intelligence metrics that may be presented on an access device operated by a user.

...read moreread less

220 citations

Journal Article•DOI•

Sentence-Based Text Analysis for Customer Reviews

[...]

Joachim Büschken, Greg M. Allenby

18 Jul 2016-Marketing Science

TL;DR: A new model for text analysis is proposed that makes use of the sentence structure contained in the reviews and it is shown that it leads to improved inference and prediction of consumer ratings relative to existing models using data from www.expedia.com and www.we8there.com.

...read moreread less

Abstract: Firms collect an increasing amount of consumer feedback in the form of unstructured consumer reviews. These reviews contain text about consumer experiences with products and services that are different from surveys that query consumers for specific information. A challenge in analyzing unstructured consumer reviews is in making sense of the topics that are expressed in the words used to describe these experiences. We propose a new model for text analysis that makes use of the sentence structure contained in the reviews and show that it leads to improved inference and prediction of consumer ratings relative to existing models using data from www.expedia.com and www.we8there.com. Sentence-based topics are found to be more distinguished and coherent than those identified from a word-based analysis. Data, as supplemental material, are available at https://doi.org/10.1287/mksc.2016.0993.

...read moreread less

220 citations

Cites background or methods from "Latent dirichlet allocation"

...This is the key idea of latent topic modeling in Latent Dirichlet Allocation (Blei et al., 2003) and the Author-Topic Model (Rosen-Zvi et al....
[...]
...The model and analysis presented in this paper is based on a class of models that are generally known as “topic” models (Blei et al. 2003; Rosen-Zvi et al. 2010), where the words contained in a consumer review reflect a latent set of ideas or sentiments, each of which is expressed with its own vocabulary....
[...]
...1 Latent Dirichlet Allocation (LDA) Model A simple model for the analysis of latent topics in text data is the Latent Dirichelet Allocation (LDA) model (Blei et al., 2003)....
[...]
...The standard LDA model proposed by (Blei et al., 2003) employs a Bayesian approach to augment the unobserved topic assignments zw of the words w....
[...]
...A simple model for the analysis of latent topics in text data is the Latent Dirichelet Allocation (LDA) model (Blei et al., 2003)....
[...]

Proceedings Article•DOI•

SCENE: a scalable two-stage personalized news recommendation system

[...]

Lei Li¹, Dingding Wang¹, Tao Li¹, Daniel Knox, Balaji Padmanabhan - Show less +1 more•Institutions (1)

Florida International University¹

24 Jul 2011

TL;DR: A scalable two-stage personalized news recommendation approach with a two-level representation, which considers the exclusive characteristics of news items when performing recommendation, and a principled framework for news selection based on the intrinsic property of user interest is presented.

...read moreread less

Abstract: Recommending news articles has become a promising research direction as the Internet provides fast access to real-time information from multiple sources around the world. Traditional news recommendation systems strive to adapt their services to individual users by virtue of both user and news content information. However, the latent relationships among different news items, and the special properties of new articles, such as short shelf lives and value of immediacy, render the previous approaches inefficient. In this paper, we propose a scalable two-stage personalized news recommendation approach with a two-level representation, which considers the exclusive characteristics (e.g., news content, access patterns, named entities, popularity and recency) of news items when performing recommendation. Also, a principled framework for news selection based on the intrinsic property of user interest is presented, with a good balance between the novelty and diversity of the recommended result. Extensive empirical experiments on a collection of news articles obtained from various news websites demonstrate the efficacy and efficiency of our approach.

...read moreread less

219 citations

Cites background from "Latent dirichlet allocation"

...Generally speaking, news content is often represented using vector space model (e.g., TF-IDF) [15], or topic distributions obtained by language models (e.g., PLSI and LDA), and specific similarity measurements are adopted to evaluate the relatedness between news articles....
[...]
...Blei argues that this step is cheating because the model is essentially refitted to the new data [3]....
[...]
...Discussion: The PLSI model and the LDA model are similar, except that in LDA the topic distribution is assumed to have a Dirichlet prior....
[...]
...Based on our analysis in Section 4.3, LDA tends to perform better than PLSI in terms of topic detection when the dataset is relatively small....
[...]
...From the result, we have the following observations: (i) LDA-based recommender system has stable recommendation performance in terms of F-score, regardless of different size of news corpus; and (ii) PLSI-based recommender system has comparable results when the news corpus becomes larger....
[...]

Proceedings Article•DOI•

Full-Text or Abstract? Examining Topic Coherence Scores Using Latent Dirichlet Allocation

[...]

Shaheen Syed¹, Marco Spruit²•Institutions (2)

Manchester Metropolitan University¹, Utrecht University²

01 Oct 2017

TL;DR: It is shown that document frequency, document word length, and vocabulary size have mixed practical effects on topic coherence and human topic ranking of LDA topics, and that large document collections are less affected by incorrect or noise terms being part of the topic-word distributions, causing topics to be more coherent and ranked higher.

...read moreread less

Abstract: This paper assesses topic coherence and human topic ranking of uncovered latent topics from scientific publications when utilizing the topic model latent Dirichlet allocation (LDA) on abstract and full-text data. The coherence of a topic, used as a proxy for topic quality, is based on the distributional hypothesis that states that words with similar meaning tend to co-occur within a similar context. Although LDA has gained much attention from machine-learning researchers, most notably with its adaptations and extensions, little is known about the effects of different types of textual data on generated topics. Our research is the first to explore these practical effects and shows that document frequency, document word length, and vocabulary size have mixed practical effects on topic coherence and human topic ranking of LDA topics. We furthermore show that large document collections are less affected by incorrect or noise terms being part of the topic-word distributions, causing topics to be more coherent and ranked higher. Differences between abstract and full-text data are more apparent within small document collections, with differences as large as 90% high-quality topics for full-text data, compared to 50% high-quality topics for abstract data.

...read moreread less

219 citations

Cites background from "Latent dirichlet allocation"

...One of the most popular and highly researched topic models is latent Dirichlet allocation (LDA) [6]....
[...]
...Unfortunately, computation of the posterior is intractable due to the denominator [6]....
[...]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
…
120
121
122
123
124
125
126
…
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Book•

Handbook of mathematical functions : with formulas, graphs, and mathematical tables

[...]

Milton Abramowitz, Irene A. Stegun

01 Jan 1970

17,608 citations

Book•

Bayesian Data Analysis

[...]

Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, Donald B. Rubin¹ - Show less +2 more•Institutions (1)

University of California, Irvine¹

01 Jan 1995

TL;DR: Detailed notes on Bayesian Computation Basics of Markov Chain Simulation, Regression Models, and Asymptotic Theorems are provided.

...read moreread less

Abstract: FUNDAMENTALS OF BAYESIAN INFERENCE Probability and Inference Single-Parameter Models Introduction to Multiparameter Models Asymptotics and Connections to Non-Bayesian Approaches Hierarchical Models FUNDAMENTALS OF BAYESIAN DATA ANALYSIS Model Checking Evaluating, Comparing, and Expanding Models Modeling Accounting for Data Collection Decision Analysis ADVANCED COMPUTATION Introduction to Bayesian Computation Basics of Markov Chain Simulation Computationally Efficient Markov Chain Simulation Modal and Distributional Approximations REGRESSION MODELS Introduction to Regression Models Hierarchical Linear Models Generalized Linear Models Models for Robust Inference Models for Missing Data NONLINEAR AND NONPARAMETRIC MODELS Parametric Nonlinear Models Basic Function Models Gaussian Process Models Finite Mixture Models Dirichlet Process Models APPENDICES A: Standard Probability Distributions B: Outline of Proofs of Asymptotic Theorems C: Computation in R and Stan Bibliographic Notes and Exercises appear at the end of each chapter.

...read moreread less

16,079 citations

"Latent dirichlet allocation" refers background in this paper

...Finally, Griffiths and Steyvers (2002) have presented a Markov chain Monte Carlo algorithm for LDA....
[...]
...Structures similar to that shown in Figure 1 are often studied in Bayesian statistical modeling, where they are referred to ashierarchical models(Gelman et al., 1995), or more precisely asconditionally independent hierarchical models(Kass and Steffey, 1989)....
[...]
...Structures similar to that shown in Figure 1 are often studied in Bayesian statistical modeling, where they are referred to as hierarchical models (Gelman et al., 1995), or more precisely as conditionally independent hierarchical models (Kass and Steffey, 1989)....
[...]

Journal Article•DOI•

Indexing by Latent Semantic Analysis

[...]

Scott Deerwester¹, Susan T. Dumais², George W. Furnas², Thomas K. Landauer², Richard A. Harshman³ - Show less +1 more•Institutions (3)

University of Chicago¹, Telcordia Technologies², University of Western Ontario³

01 Sep 1990-Journal of the Association for Information Science and Technology

TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.

...read moreread less

Abstract: A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.

...read moreread less

12,443 citations

"Latent dirichlet allocation" refers methods in this paper

...To address these shortcomings, IR researchers have proposed several other dimensionality reduction techniques, most notably latent semantic indexing (LSI) (Deerwester et al., 1990)....
[...]
...To address these shortcomings, IR researchers have proposed several other dimensionality reduction techniques, most notablylatent semantic indexing (LSI)(Deerwester et al., 1990)....
[...]

Book•

Introduction to Modern Information Retrieval

[...]

Gerard Salton, Michael J. McGill

01 Jan 1983

TL;DR: Reading is a need and a hobby at once and this condition is the on that will make you feel that you must read.

...read moreread less

Abstract: Some people may be laughing when looking at you reading in your spare time. Some may be admired of you. And some may want be like you who have reading hobby. What about your own feel? Have you felt right? Reading is a need and a hobby at once. This condition is the on that will make you feel that you must read. If you know are looking for the book enPDFd introduction to modern information retrieval as the choice of reading, you can find here.

...read moreread less

12,059 citations

"Latent dirichlet allocation" refers background or methods in this paper

...In the populartf-idf scheme (Salton and McGill, 1983), a basic vocabulary of “words” or “terms” is chosen, and, for each document in the corpus, a count is formed of the number of occurrences of each word....
[...]
...We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model....
[...]

Book•

Theory of probability

[...]

Harold Jeffreys, R. Bruce Lindsay

01 Jan 1939

TL;DR: In this paper, the authors introduce the concept of direct probabilities, approximate methods and simplifications, and significant importance tests for various complications, including one new parameter, and various complications for frequency definitions and direct methods.

...read moreread less

Abstract: 1. Fundamental notions 2. Direct probabilities 3. Estimation problems 4. Approximate methods and simplifications 5. Significance tests: one new parameter 6. Significance tests: various complications 7. Frequency definitions and direct methods 8. General questions

...read moreread less

7,086 citations