Latent dirichlet allocation

doi:10.5555/944919.944937

Home
/
Papers
/
Latent dirichlet allocation

Journal Article•DOI•

Latent dirichlet allocation

David M. Blei¹, Andrew Y. Ng², Michael I. Jordan¹•Institutions (2)

University of California, Berkeley¹, Stanford University²

01 Mar 2003-Journal of Machine Learning Research (JMLR.org)-Vol. 3, pp 993-1022

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.

read less

Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Bad news travel fast: a content-based analysis of interestingness on Twitter

[...]

Nasir Naveed¹, Thomas Gottron¹, Jérôme Kunegis¹, Arifah Che Alhadi¹•Institutions (1)

University of Koblenz and Landau¹

15 Jun 2011

TL;DR: This paper analyzes a set of high- and low-level content-based features on several large collections of Twitter messages to obtain insights into what makes a message on Twitter worth retweeting and, thus, interesting.

...read moreread less

Abstract: On the microblogging site Twitter, users can forward any message they receive to all of their followers. This is called a retweet and is usually done when users find a message particularly interesting and worth sharing with others. Thus, retweets reflect what the Twitter community considers interesting on a global scale, and can be used as a function of interestingness to generate a model to describe the content-based characteristics of retweets. In this paper, we analyze a set of high- and low-level content-based features on several large collections of Twitter messages. We train a prediction model to forecast for a given tweet its likelihood of being retweeted based on its contents. From the parameters learned by the model we deduce what are the influential content features that contribute to the likelihood of a retweet. As a result we obtain insights into what makes a message on Twitter worth retweeting and, thus, interesting.

...read moreread less

382 citations

Proceedings Article•DOI•

Rated aspect summarization of short comments

[...]

Yue Lu¹, ChengXiang Zhai¹, Neel Sundaresan²•Institutions (2)

University of Illinois at Urbana–Champaign¹, eBay²

20 Apr 2009

TL;DR: The proposed methods are quite general and can be used to generate rated aspect summary automatically given any collection of short comments each associated with an overall rating.

...read moreread less

Abstract: Web 2.0 technologies have enabled more and more people to freely comment on different kinds of entities (e.g. sellers, products, services). The large scale of information poses the need and challenge of automatic summarization. In many cases, each of the user-generated short comments comes with an overall rating. In this paper, we study the problem of generating a ``rated aspect summary'' of short comments, which is a decomposed view of the overall ratings for the major aspects so that a user could gain different perspectives towards the target entity. We formally define the problem and decompose the solution into three steps. We demonstrate the effectiveness of our methods by using eBay sellers' feedback comments. We also quantitatively evaluate each step of our methods and study how well human agree on such a summarization task. The proposed methods are quite general and can be used to generate rated aspect summary automatically given any collection of short comments each associated with an overall rating.

...read moreread less

381 citations

Journal Article•DOI•

Facebook language predicts depression in medical records.

[...]

Johannes C. Eichstaedt¹, Robert J. Smith¹, Raina M. Merchant¹, Lyle H. Ungar¹, Patrick Crutchley¹, Daniel Preoţiuc-Pietro¹, David A. Asch¹, David A. Asch², H. Andrew Schwartz³ - Show less +5 more•Institutions (3)

University of Pennsylvania¹, Veterans Health Administration², Stony Brook University³

30 Oct 2018-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: It is shown that the content shared by consenting users on Facebook can predict a future occurrence of depression in their medical records, and language predictive of depression includes references to typical symptoms, including sadness, loneliness, hostility, rumination, and increased self-reference.

...read moreread less

Abstract: Depression, the most prevalent mental illness, is underdiagnosed and undertreated, highlighting the need to extend the scope of current screening methods. Here, we use language from Facebook posts of consenting individuals to predict depression recorded in electronic medical records. We accessed the history of Facebook statuses posted by 683 patients visiting a large urban academic emergency department, 114 of whom had a diagnosis of depression in their medical records. Using only the language preceding their first documentation of a diagnosis of depression, we could identify depressed patients with fair accuracy [area under the curve (AUC) = 0.69], approximately matching the accuracy of screening surveys benchmarked against medical records. Restricting Facebook data to only the 6 months immediately preceding the first documented diagnosis of depression yielded a higher prediction accuracy (AUC = 0.72) for those users who had sufficient Facebook data. Significant prediction of future depression status was possible as far as 3 months before its first documentation. We found that language predictors of depression include emotional (sadness), interpersonal (loneliness, hostility), and cognitive (preoccupation with the self, rumination) processes. Unobtrusive depression assessment through social media of consenting individuals may become feasible as a scalable complement to existing screening and monitoring procedures.

...read moreread less

381 citations

Journal Article•DOI•

A review of natural language processing techniques for opinion mining systems

[...]

Shiliang Sun¹, Chen Luo¹, Junyu Chen¹•Institutions (1)

East China Normal University¹

01 Jul 2017-Information Fusion

TL;DR: This paper introduces general NLP techniques which are required for text preprocessing, and investigates the approaches of opinion mining for different levels and situations, and introduces comparative opinion mining and deep learning approaches for opinion mining.

...read moreread less

381 citations

Cites methods from "Latent dirichlet allocation"

...Li et al. [73] proposed a Dependency-Sentiment-LDA model, which assumes that the sentiments of words form a Markov chain, i.e., the sentiment of a word is dependent on previous ones....
[...]
...Probabilistic generative model based approaches LDA topic model and its variants are adopted for aspects detection [88] and jointly aspect and sentiment detection [93, 94]....
[...]
...Probabilistic generative model based approaches Inspired by the LDA topic model, some generative models were proposed, including joint sentiment topic model [72] and dependency-sentimentLDA model [73], which model the transitions between sentiments of words with a Markov chain....
[...]
...OpenNLP JAVA The Apache OpenNLP is a JAVA library for the processing of natural language texts, which supports common tasks including tokenization, sentence segmentation, POS tagging, named entity recognition, parsing, and coreference resolution. https://opennlp.apache.org CoreNLP [54] JAVA Stanford CoreNLP is a framework which supports not only basic NLP task, such as POS tagging, named entity recognization, parsing, coreference resolution, but also advanced sentiment analysis [55]. http://stanfordnlp.github.io/CoreNLP/ Gensim [56] Python Gensim is an open source library for topic modeling which includes online Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), Random Projection, Hierarchical Dirichlet Process and word2vec....
[...]
...Several unsupervised methods based on Latent Dirichlet Allocation (LDA) [12] have been proposed to release the dependence of annotated corpora....
[...]

Proceedings Article•DOI•

Computing a nonnegative matrix factorization -- provably

[...]

Sanjeev Arora¹, Rong Ge¹, Ravi Kannan², Ankur Moitra³•Institutions (3)

Princeton University¹, Microsoft², Institute for Advanced Study³

19 May 2012

TL;DR: This work gives an algorithm that runs in time polynomial in n, m and r under the separablity condition identified by Donoho and Stodden in 2003, and is the firstPolynomial-time algorithm that provably works under a non-trivial condition on the input matrix.

...read moreread less

Abstract: The Nonnegative Matrix Factorization (NMF) problem has a rich history spanning quantum mechanics, probability theory, data analysis, polyhedral combinatorics, communication complexity, demography, chemometrics, etc. In the past decade NMF has become enormously popular in machine learning, where the factorization is computed using a variety of local search heuristics. Vavasis recently proved that this problem is NP-complete. We initiate a study of when this problem is solvable in polynomial time. Consider a nonnegative m x n matrix $M$ and a target inner-dimension r. Our results are the following: - We give a polynomial-time algorithm for exact and approximate NMF for every constant r. Indeed NMF is most interesting in applications precisely when r is small. We complement this with a hardness result, that if exact NMF can be solved in time (nm)o(r), 3-SAT has a sub-exponential time algorithm. Hence, substantial improvements to the above algorithm are unlikely. - We give an algorithm that runs in time polynomial in n, m and r under the separablity condition identified by Donoho and Stodden in 2003. The algorithm may be practical since it is simple and noise tolerant (under benign assumptions). Separability is believed to hold in many practical settings. To the best of our knowledge, this last result is the first polynomial-time algorithm that provably works under a non-trivial condition on the input matrix and we believe that this will be an interesting and important direction for future work.

...read moreread less

380 citations

Cites background from "Latent dirichlet allocation"

...For instance it is usually satis.ed by derived parameters .tted to various generative models (e.g. LDA [5] in information retrieval) [4]....
[...]
...LDA [5] in information retrieval) and seems to be quite a natural condition....
[...]
...This condition is usually satis.ed [4] by model parameters .tted to various generative models (e.g. LDA [5] in information retrieval) and seems to be quite a natural condition....
[...]
...LDA [5] in information retrieval) [4]....
[...]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
…
59
60
61
62
63
64
65
…
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Book•

Handbook of mathematical functions : with formulas, graphs, and mathematical tables

[...]

Milton Abramowitz, Irene A. Stegun

01 Jan 1970

17,608 citations

Book•

Bayesian Data Analysis

[...]

Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, Donald B. Rubin¹ - Show less +2 more•Institutions (1)

University of California, Irvine¹

01 Jan 1995

TL;DR: Detailed notes on Bayesian Computation Basics of Markov Chain Simulation, Regression Models, and Asymptotic Theorems are provided.

...read moreread less

Abstract: FUNDAMENTALS OF BAYESIAN INFERENCE Probability and Inference Single-Parameter Models Introduction to Multiparameter Models Asymptotics and Connections to Non-Bayesian Approaches Hierarchical Models FUNDAMENTALS OF BAYESIAN DATA ANALYSIS Model Checking Evaluating, Comparing, and Expanding Models Modeling Accounting for Data Collection Decision Analysis ADVANCED COMPUTATION Introduction to Bayesian Computation Basics of Markov Chain Simulation Computationally Efficient Markov Chain Simulation Modal and Distributional Approximations REGRESSION MODELS Introduction to Regression Models Hierarchical Linear Models Generalized Linear Models Models for Robust Inference Models for Missing Data NONLINEAR AND NONPARAMETRIC MODELS Parametric Nonlinear Models Basic Function Models Gaussian Process Models Finite Mixture Models Dirichlet Process Models APPENDICES A: Standard Probability Distributions B: Outline of Proofs of Asymptotic Theorems C: Computation in R and Stan Bibliographic Notes and Exercises appear at the end of each chapter.

...read moreread less

16,079 citations

"Latent dirichlet allocation" refers background in this paper

...Finally, Griffiths and Steyvers (2002) have presented a Markov chain Monte Carlo algorithm for LDA....
[...]
...Structures similar to that shown in Figure 1 are often studied in Bayesian statistical modeling, where they are referred to ashierarchical models(Gelman et al., 1995), or more precisely asconditionally independent hierarchical models(Kass and Steffey, 1989)....
[...]
...Structures similar to that shown in Figure 1 are often studied in Bayesian statistical modeling, where they are referred to as hierarchical models (Gelman et al., 1995), or more precisely as conditionally independent hierarchical models (Kass and Steffey, 1989)....
[...]

Journal Article•DOI•

Indexing by Latent Semantic Analysis

[...]

Scott Deerwester¹, Susan T. Dumais², George W. Furnas², Thomas K. Landauer², Richard A. Harshman³ - Show less +1 more•Institutions (3)

University of Chicago¹, Telcordia Technologies², University of Western Ontario³

01 Sep 1990-Journal of the Association for Information Science and Technology

TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.

...read moreread less

Abstract: A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.

...read moreread less

12,443 citations

"Latent dirichlet allocation" refers methods in this paper

...To address these shortcomings, IR researchers have proposed several other dimensionality reduction techniques, most notably latent semantic indexing (LSI) (Deerwester et al., 1990)....
[...]
...To address these shortcomings, IR researchers have proposed several other dimensionality reduction techniques, most notablylatent semantic indexing (LSI)(Deerwester et al., 1990)....
[...]

Book•

Introduction to Modern Information Retrieval

[...]

Gerard Salton, Michael J. McGill

01 Jan 1983

TL;DR: Reading is a need and a hobby at once and this condition is the on that will make you feel that you must read.

...read moreread less

Abstract: Some people may be laughing when looking at you reading in your spare time. Some may be admired of you. And some may want be like you who have reading hobby. What about your own feel? Have you felt right? Reading is a need and a hobby at once. This condition is the on that will make you feel that you must read. If you know are looking for the book enPDFd introduction to modern information retrieval as the choice of reading, you can find here.

...read moreread less

12,059 citations

"Latent dirichlet allocation" refers background or methods in this paper

...In the populartf-idf scheme (Salton and McGill, 1983), a basic vocabulary of “words” or “terms” is chosen, and, for each document in the corpus, a count is formed of the number of occurrences of each word....
[...]
...We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model....
[...]

Book•

Theory of probability

[...]

Harold Jeffreys, R. Bruce Lindsay

01 Jan 1939

TL;DR: In this paper, the authors introduce the concept of direct probabilities, approximate methods and simplifications, and significant importance tests for various complications, including one new parameter, and various complications for frequency definitions and direct methods.

...read moreread less

Abstract: 1. Fundamental notions 2. Direct probabilities 3. Estimation problems 4. Approximate methods and simplifications 5. Significance tests: one new parameter 6. Significance tests: various complications 7. Frequency definitions and direct methods 8. General questions

...read moreread less

7,086 citations