scispace - formally typeset
Search or ask a question
Topic

Latent Dirichlet allocation

About: Latent Dirichlet allocation is a research topic. Over the lifetime, 5351 publications have been published within this topic receiving 212555 citations. The topic is also known as: LDA.


Papers
More filters
Proceedings Article
21 Mar 2012
TL;DR: An approximate algorithm cast around a Laplace approximation in a transformed basis is presented, which allows documents to be associated with elements of a Hilbert space, admitting kernel topic models (KTM), modelling temporal, spatial, hierarchical, social and other structure between documents.
Abstract: Latent Dirichlet Allocation models discrete data as a mixture of discrete distributions, using Dirichlet beliefs over the mixture weights. We study a variation of this concept, in which the documents’ mixture weight beliefs are replaced with squashed Gaussian distributions. This allows documents to be associated with elements of a Hilbert space, admitting kernel topic models (KTM), modelling temporal, spatial, hierarchical, social and other structure between documents. The main challenge is ecient approximate inference on the latent Gaussian. We present an approximate algorithm cast around a Laplace approximation in a transformed basis. The KTM can also be interpreted as a type of Gaussian process latent variable model, or as a topic model conditional on document features, uncovering links between earlier work in these areas.

40 citations

Journal ArticleDOI
TL;DR: The main aim of this article is to present the results of different experiments focused on the problem of model fitting process in topic modeling and its accuracy when applied to long texts and present a clear-cut power-law relation between the optimal number of topics and the analyzed sample size.
Abstract: The main aim of this article is to present the results of different experiments focused on the problem of model fitting process in topic modeling and its accuracy when applied to long texts. At the same time, in fact, the digital era has made available both enormous quantities of textual data and technological advances that have facilitated the development of techniques to automate the data coding and analysis processes. In the ambit of topic modeling, different procedures were born in order to analyze larger and larger collections of texts, namely corpora, but this has posed, and continues to pose, a series of methodological questions that urgently need to be resolved. Therefore, through a series of different experiments, this article is based on the following consideration: taking into account Latent Dirichlet Allocation (LDA), a generative probabilistic model (Blei et al. in J Mach Learn Res 3:993–1022, 2003; Blei and Lafferty in: Srivastava, Sahami (eds) Text mining: classification, clustering, and applications, Chapman & Hall/CRC Press, Cambridge, 2009; Griffiths and Steyvers in Proc Natl Acad Sci USA (PNAS), 101(Supplement 1):5228–5235, 2004), the problem of fitting model is crucial because the LDA algorithm demands that the number of topics is specified a priori. Needles to say, the number of topics to detect in a corpus is a parameter which affect the analysis results. Since there is a lack of experiments applied to long texts, our article tries to shed new light on the complex relationship between texts’ length and the optimal number of topics. In the conclusions, we present a clear-cut power-law relation between the optimal number of topics and the analyzed sample size, and we formulate it in a form of a mathematical model.

40 citations

Proceedings ArticleDOI
13 Dec 2010
TL;DR: An unsupervised semantic labeling framework based on the Latent Dirichlet Allocation method is used to analyze over 70 images collected under different spatial and temporal settings over the globe representing two major semantic categories: nuclear and coal power plants.
Abstract: Nuclear proliferation is a major national security concern for many countries. Existing feature extraction and classification approaches are not suitable for monitoring proliferation activity using high-resolution multi-temporal remote sensing imagery. In this paper we present an unsupervised semantic labeling framework based on the Latent Dirichlet Allocation method. This framework is used to analyze over 70 images collected under different spatial and temporal settings over the globe representing two major semantic categories: nuclear and coal power plants. Initial experimental results show a reasonable discrimination of these two categories even though they share highly overlapping and common objects. This research also identified several research challenges associated with nuclear proliferation monitoring using high resolution remote sensing images.

40 citations

Proceedings ArticleDOI
13 Jun 2016
TL;DR: Experimental results indicate that document features generated by the hybrid method are useful to improve classification performance by consolidating both global and local relationships.
Abstract: Latent Dirichlet Allocation (LDA) is a probabilistic topic model to discover latent topics from documents and describe each document with a probability distribution over the discovered topics. It defines a global hierarchical relationship from words to a topic and then from topics to a document. Word2Vec is a word-embedding model to predict a target word from its surrounding contextual words. In this paper, we propose a hybrid approach to extract features from documents with bag-of-distances in a semantic space. By using both Word2Vec and LDA, our hybrid method not only generates the relationships between documents and topics, but also integrates the contextual relationships among words. Experimental results indicate that document features generated by our hybrid method are useful to improve classification performance by consolidating both global and local relationships.

40 citations

Journal ArticleDOI
TL;DR: This paper presents a new method to tackle the problem of data sparseness in the classification of short texts using statistical methods by building a strong feature thesaurus (SFT) based on latent Dirichlet allocation (LDA) and information gain (IG) models.
Abstract: Data sparseness, the evident characteristic of short text, has always been regarded as the main cause of the low accuracy in the classification of short texts using statistical methods. Intensive research has been conducted in this area during the past decade. However, most researchers failed to notice that ignoring the semantic importance of certain feature terms might also contribute to low classification accuracy. In this paper we present a new method to tackle the problem by building a strong feature thesaurus (SFT) based on latent Dirichlet allocation (LDA) and information gain (IG) models. By giving larger weights to feature terms in SFT, the classification accuracy can be improved. Specifically, our method appeared to be more effective with more detailed classification. Experiments in two short text datasets demonstrate that our approach achieved improvement compared with the state-of-the-art methods including support vector machine (SVM) and Naive Bayes Multinomial.

40 citations


Network Information
Related Topics (5)
Cluster analysis
146.5K papers, 2.9M citations
86% related
Support vector machine
73.6K papers, 1.7M citations
86% related
Deep learning
79.8K papers, 2.1M citations
85% related
Feature extraction
111.8K papers, 2.1M citations
84% related
Convolutional neural network
74.7K papers, 2M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023323
2022842
2021418
2020429
2019473
2018446