Latent dirichlet allocation

doi:10.5555/944919.944937

Home
/
Papers
/
Latent dirichlet allocation

Journal Article•DOI•

Latent dirichlet allocation

David M. Blei¹, Andrew Y. Ng², Michael I. Jordan¹•Institutions (2)

University of California, Berkeley¹, Stanford University²

01 Mar 2003-Journal of Machine Learning Research (JMLR.org)-Vol. 3, pp 993-1022

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.

read less

Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Personalized Recommendation via Integrated Diffusion on User-Item-Tag Tripartite Graphs

[...]

Zi-Ke Zhang¹, Tao Zhou², Tao Zhou¹, Yi-Cheng Zhang¹•Institutions (2)

University of Fribourg¹, University of Science and Technology of China²

01 Jan 2010-Physica A-statistical Mechanics and Its Applications

TL;DR: Experimental results demonstrate that the usage of tag information can significantly improve accuracy, diversification and novelty of recommendations.

...read moreread less

Abstract: Personalized recommender systems are confronting great challenges of accuracy, diversification and novelty, especially when the data set is sparse and lacks accessorial information, such as user profiles, item attributes and explicit ratings. Collaborative tags contain rich information about personalized preferences and item contents, and are therefore potential to help in providing better recommendations. In this article, we propose a recommendation algorithm based on an integrated diffusion on user–item–tag tripartite graphs. We use three benchmark data sets, Del.icio.us , MovieLens and BibSonomy , to evaluate our algorithm. Experimental results demonstrate that the usage of tag information can significantly improve accuracy, diversification and novelty of recommendations.

...read moreread less

231 citations

Cites methods from "Latent dirichlet allocation"

...eworks of collaborative filtering [11] and iterative diffusion algorithm [9], as well as some more complicated methods such as Probabilistic Latent Semantic Analysis [36], Latent Dirichlet Allocation [37] and Iterative Latent Semantic Analysis [38]. Systematic investigation on tag-aware recommendation algorithms must be very helpful in the futuredesign of recommender systems. 5. Acknowledgement We ack...
[...]

Proceedings Article•DOI•

Short text conceptualization using a probabilistic knowledgebase

[...]

Yangqiu Song¹, Haixun Wang¹, Zhongyuan Wang¹, Hongsong Li¹, Weizhu Chen¹ - Show less +1 more•Institutions (1)

Microsoft¹

16 Jul 2011

TL;DR: This paper develops a Bayesian inference mechanism to conceptualize words and short text by using a probabilistic knowledgebase that is as rich as the authors' mental world in terms of the concepts it contains and brings significant improvements in short text understanding.

...read moreread less

Abstract: Most text mining tasks, including clustering and topic detection, are based on statistical methods that treat text as bags of words. Semantics in the text is largely ignored in the mining process, and mining results often have low interpretability. One particular challenge faced by such approaches lies in short text understanding, as short texts lack enough content from which statistical conclusions can be drawn easily. In this paper, we improve text understanding by using a probabilistic knowledgebase that is as rich as our mental world in terms of the concepts (of worldly facts) it contains. We then develop a Bayesian inference mechanism to conceptualize words and short text. We conducted comprehensive experiments on conceptualizing textual terms, and clustering short pieces of text such as Twitter messages. Compared to purely statistical methods such as latent semantic topic modeling or methods that use existing knowledge-bases (e.g., WordNet, Freebase and Wikipedia), our approach brings significant improvements in short text understanding as reflected by the clustering accuracy.

...read moreread less

231 citations

Cites background from "Latent dirichlet allocation"

...Statistical topic modeling [Blei et al., 2003; Blei and Lafferty, 2009] also requires sufficient words in a document to infer the document topic distribution....
[...]
...Compared with traditional latent semantic analysis (LSA) [Deerwester et al., 1990] and topic modeling such as latent Dirichlet allocation (LDA) [Blei et al., 2003], explicit semantic analysis (ESA) has the advantage of providing semantics that are interpretable by human beings....
[...]

Journal Article•DOI•

A survey on classification techniques for opinion mining and sentiment analysis

[...]

Fatemeh Hemmatian¹, Mohammad Karim Sohrabi¹•Institutions (1)

Islamic Azad University¹

01 Oct 2019-Artificial Intelligence Review

TL;DR: This paper represents a complete, multilateral and systematic review of opinion mining and sentiment analysis to classify available methods and compare their advantages and drawbacks, in order to have better understanding of available challenges and solutions to clarify the future direction.

...read moreread less

Abstract: Opinion mining is considered as a subfield of natural language processing, information retrieval and text mining. Opinion mining is the process of extracting human thoughts and perceptions from unstructured texts, which with regard to the emergence of online social media and mass volume of users’ comments, has become to a useful, attractive and also challenging issue. There are varieties of researches with different trends and approaches in this area, but the lack of a comprehensive study to investigate them from all aspects is tangible. In this paper we represent a complete, multilateral and systematic review of opinion mining and sentiment analysis to classify available methods and compare their advantages and drawbacks, in order to have better understanding of available challenges and solutions to clarify the future direction. For this purpose, we present a proper framework of opinion mining accompanying with its steps and levels and then we completely monitor, classify, summarize and compare proposed techniques for aspect extraction, opinion classification, summary production and evaluation, based on the major validated scientific works. In order to have a better comparison, we also propose some factors in each category, which help to have a better understanding of advantages and disadvantages of different methods.

...read moreread less

231 citations

Cites methods from "Latent dirichlet allocation"

...Their results show that systems based on LDA provide useful information about their staff members....
[...]
...Ma et al. (2015) proposed an approach of probabilistic topic model based on LDA in order to semantic search over citizens opinions about city issues on online platforms....
[...]
...Also LDA and LSA use the bag of words represented in documents, so they can be used only in document level opinion mining....
[...]
...Since this approach, uses statistical methods like latent semantic analysis (LSA) (Hofmann 1999) and latent Dirichlet allocation (LDA) (Blei et al. 2003), it is called statistical models too....
[...]

Journal Article•DOI•

Automatic detection of cyberbullying in social media text

[...]

Cynthia Van Hee¹, Gilles Jacobs¹, Chris Emmery², Bart Desmet¹, Els Lefever¹, Ben Verhoeven², Guy De Pauw², Walter Daelemans², Veronique Hoste¹ - Show less +5 more•Institutions (2)

Ghent University¹, University of Antwerp²

08 Oct 2018-PLOS ONE

TL;DR: This paper describes the collection and fine-grained annotation of a cyberbullying corpus for English and Dutch and performs a series of binary classification experiments to determine the feasibility of automatic cyberbullies detection.

...read moreread less

Abstract: While social media offer great communication opportunities, they also increase the vulnerability of young people to threatening situations online. Recent studies report that cyberbullying constitutes a growing problem among youngsters. Successful prevention depends on the adequate detection of potentially harmful messages and the information overload on the Web requires intelligent systems to identify potential risks automatically. The focus of this paper is on automatic cyberbullying detection in social media text by modelling posts written by bullies, victims, and bystanders of online bullying. We describe the collection and fine-grained annotation of a cyberbullying corpus for English and Dutch and perform a series of binary classification experiments to determine the feasibility of automatic cyberbullying detection. We make use of linear support vector machines exploiting a rich feature set and investigate which information sources contribute the most for the task. Experiments on a hold-out test set reveal promising results for the detection of cyberbullying-related posts. After optimisation of the hyperparameters, the classifier yields an F1 score of 64% and 61% for English and Dutch respectively, and considerably outperforms baseline systems.

...read moreread less

231 citations

Cites methods from "Latent dirichlet allocation"

...• Topic model features: by making use of the Gensim topic modelling library [80], several LDA [81] and LSI [82] topic models with varying granularity (k = 20, 50, 100 and 200) were trained on data corresponding to each fine-grained category of a cyberbullying event (e.g. threats, defamations, insults, defenses)....
[...]
...• Topic model features: by making use of the Gensim topic modelling library (Rehurek & Sojka, 2010), several LDA (Blei et al., 2003) and LSI (Deerwester et al....
[...]

Journal Article•DOI•

Large-Scale Meta-Analysis of Human Medial Frontal Cortex Reveals Tripartite Functional Organization

[...]

Alejandro de la Vega¹, Luke J. Chang², Marie T. Banich¹, Tor D. Wager¹, Tal Yarkoni³ - Show less +1 more•Institutions (3)

University of Colorado Boulder¹, Dartmouth College², University of Texas at Austin³

15 Jun 2016-The Journal of Neuroscience

TL;DR: A meta-analysis across nearly 10,000 fMRI studies to comprehensively map psychological states to discrete subregions in medial frontal cortex using relatively unbiased data-driven methods provides hypotheses about the functional organization of medial prefrontal cortex that can be tested explicitly in future studies.

...read moreread less

Abstract: The functional organization of human medial frontal cortex (MFC) is a subject of intense study. Using fMRI, the MFC has been associated with diverse psychological processes, including motor function, cognitive control, affect, and social cognition. However, there have been few large-scale efforts to comprehensively map specific psychological functions to subregions of medial frontal anatomy. Here we applied a meta-analytic data-driven approach to nearly 10,000 fMRI studies to identify putatively separable regions of MFC and determine which psychological states preferentially recruit their activation. We identified regions at several spatial scales on the basis of meta-analytic coactivation, revealing three broad functional zones along a rostrocaudal axis composed of 2–4 smaller subregions each. Multivariate classification analyses aimed at identifying the psychological functions most strongly predictive of activity in each region revealed a tripartite division within MFC, with each zone displaying a relatively distinct functional signature. The posterior zone was associated preferentially with motor function, the middle zone with cognitive control, pain, and affect, and the anterior with reward, social processing, and episodic memory. Within each zone, the more fine-grained subregions showed distinct, but subtler, variations in psychological function. These results provide hypotheses about the functional organization of medial prefrontal cortex that can be tested explicitly in future studies. SIGNIFICANCE STATEMENT Activation of medial frontal cortex in fMRI studies is associated with a wide range of psychological states ranging from cognitive control to pain. However, this high rate of activation makes it challenging to determine how these various processes are topologically organized across medial frontal anatomy. We conducted a meta-analysis across nearly 10,000 studies to comprehensively map psychological states to discrete subregions in medial frontal cortex using relatively unbiased data-driven methods. This approach revealed three distinct zones that differed substantially in function, each of which were further subdivided into 2–4 smaller subregions that showed additional functional variation. Each individual region was recruited by multiple psychological states, suggesting subregions of medial frontal cortex are functionally heterogeneous.

...read moreread less

231 citations

Cites methods from "Latent dirichlet allocation"

...To remedy this problem, we used a reduced semantic representation of the latent conceptual structure underlying the neuroimaging literature: a set of 60 topics derived using latent dirichlet allocation topic modeling (Blei et al., 2003)....
[...]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
…
115
116
117
118
119
120
121
…
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Book•

Handbook of mathematical functions : with formulas, graphs, and mathematical tables

[...]

Milton Abramowitz, Irene A. Stegun

01 Jan 1970

17,608 citations

Book•

Bayesian Data Analysis

[...]

Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, Donald B. Rubin¹ - Show less +2 more•Institutions (1)

University of California, Irvine¹

01 Jan 1995

TL;DR: Detailed notes on Bayesian Computation Basics of Markov Chain Simulation, Regression Models, and Asymptotic Theorems are provided.

...read moreread less

Abstract: FUNDAMENTALS OF BAYESIAN INFERENCE Probability and Inference Single-Parameter Models Introduction to Multiparameter Models Asymptotics and Connections to Non-Bayesian Approaches Hierarchical Models FUNDAMENTALS OF BAYESIAN DATA ANALYSIS Model Checking Evaluating, Comparing, and Expanding Models Modeling Accounting for Data Collection Decision Analysis ADVANCED COMPUTATION Introduction to Bayesian Computation Basics of Markov Chain Simulation Computationally Efficient Markov Chain Simulation Modal and Distributional Approximations REGRESSION MODELS Introduction to Regression Models Hierarchical Linear Models Generalized Linear Models Models for Robust Inference Models for Missing Data NONLINEAR AND NONPARAMETRIC MODELS Parametric Nonlinear Models Basic Function Models Gaussian Process Models Finite Mixture Models Dirichlet Process Models APPENDICES A: Standard Probability Distributions B: Outline of Proofs of Asymptotic Theorems C: Computation in R and Stan Bibliographic Notes and Exercises appear at the end of each chapter.

...read moreread less

16,079 citations

"Latent dirichlet allocation" refers background in this paper

...Finally, Griffiths and Steyvers (2002) have presented a Markov chain Monte Carlo algorithm for LDA....
[...]
...Structures similar to that shown in Figure 1 are often studied in Bayesian statistical modeling, where they are referred to ashierarchical models(Gelman et al., 1995), or more precisely asconditionally independent hierarchical models(Kass and Steffey, 1989)....
[...]
...Structures similar to that shown in Figure 1 are often studied in Bayesian statistical modeling, where they are referred to as hierarchical models (Gelman et al., 1995), or more precisely as conditionally independent hierarchical models (Kass and Steffey, 1989)....
[...]

Journal Article•DOI•

Indexing by Latent Semantic Analysis

[...]

Scott Deerwester¹, Susan T. Dumais², George W. Furnas², Thomas K. Landauer², Richard A. Harshman³ - Show less +1 more•Institutions (3)

University of Chicago¹, Telcordia Technologies², University of Western Ontario³

01 Sep 1990-Journal of the Association for Information Science and Technology

TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.

...read moreread less

Abstract: A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.

...read moreread less

12,443 citations

"Latent dirichlet allocation" refers methods in this paper

...To address these shortcomings, IR researchers have proposed several other dimensionality reduction techniques, most notably latent semantic indexing (LSI) (Deerwester et al., 1990)....
[...]
...To address these shortcomings, IR researchers have proposed several other dimensionality reduction techniques, most notablylatent semantic indexing (LSI)(Deerwester et al., 1990)....
[...]

Book•

Introduction to Modern Information Retrieval

[...]

Gerard Salton, Michael J. McGill

01 Jan 1983

TL;DR: Reading is a need and a hobby at once and this condition is the on that will make you feel that you must read.

...read moreread less

Abstract: Some people may be laughing when looking at you reading in your spare time. Some may be admired of you. And some may want be like you who have reading hobby. What about your own feel? Have you felt right? Reading is a need and a hobby at once. This condition is the on that will make you feel that you must read. If you know are looking for the book enPDFd introduction to modern information retrieval as the choice of reading, you can find here.

...read moreread less

12,059 citations

"Latent dirichlet allocation" refers background or methods in this paper

...In the populartf-idf scheme (Salton and McGill, 1983), a basic vocabulary of “words” or “terms” is chosen, and, for each document in the corpus, a count is formed of the number of occurrences of each word....
[...]
...We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model....
[...]

Book•

Theory of probability

[...]

Harold Jeffreys, R. Bruce Lindsay

01 Jan 1939

TL;DR: In this paper, the authors introduce the concept of direct probabilities, approximate methods and simplifications, and significant importance tests for various complications, including one new parameter, and various complications for frequency definitions and direct methods.

...read moreread less

Abstract: 1. Fundamental notions 2. Direct probabilities 3. Estimation problems 4. Approximate methods and simplifications 5. Significance tests: one new parameter 6. Significance tests: various complications 7. Frequency definitions and direct methods 8. General questions

...read moreread less

7,086 citations