scispace - formally typeset
Search or ask a question
Topic

Latent Dirichlet allocation

About: Latent Dirichlet allocation is a research topic. Over the lifetime, 5351 publications have been published within this topic receiving 212555 citations. The topic is also known as: LDA.


Papers
More filters
Journal ArticleDOI
TL;DR: In this paper, an unsupervised machine learning technique based on the probabilistic generative model of Latent Dirichlet Allocation is proposed to learn the underlying structure of collider events directly from the data.
Abstract: We describe a technique to learn the underlying structure of collider events directly from the data, without having a particular theoretical model in mind. It allows to infer aspects of the theoretical model that may have given rise to this structure, and can be used to cluster or classify the events for analysis purposes. The unsupervised machine-learning technique is based on the probabilistic (Bayesian) generative model of Latent Dirichlet Allocation. We pair the model with an approximate inference algorithm called Variational Inference, which we then use to extract the latent probability distributions describing the learned underlying structure of collider events. We provide a detailed systematic study of the technique using two example scenarios to learn the latent structure of di-jet event samples made up of QCD background events and either $$ t\overline{t} $$ or hypothetical W′ → (ϕ → WW)W signal events.

32 citations

Proceedings ArticleDOI
24 Sep 2007
TL;DR: Experimental results indicate that using a grounded language model nearly doubles performance on a held out test set, extending a traditional language model based approach to information retrieval.
Abstract: This paper presents a methodology for automatically indexing a large corpus of broadcast baseball games using an unsupervised content-based approach. The method relies on the learning of a grounded language model which maps query terms to the non-linguistic context to which they refer. Grounded language models are learned from a large, unlabeled corpus of video events. Events are represented using a codebook of automatically discovered temporal patterns of low level features extracted from the raw video. These patterns are associated with words extracted from the closed captioning text using a generalization of Latent Dirichlet Allocation. We evaluate the benefit of the grounded language model by extending a traditional language model based approach to information retrieval. Experimental results indicate that using a grounded language model nearly doubles performance on a held out test set.

32 citations

Journal ArticleDOI
TL;DR: A new perspective in dynamic classification for SITS is offered and several studies on forms of relief, weather forecast, and very high resolution images are used to explain the wide range of structures responsible for influencing the dynamic inside the resolution cell.
Abstract: With a continuous increase in the number of Earth Observation satellites, leading to the development of satellite image time series (SITS), the number of algorithms for land cover analysis and monitoring has greatly expanded. This paper offers a new perspective in dynamic classification for SITS. Four similarity measures (correlation coefficient, Kullback-Leibler divergence, conditional information, and normalized compression distance) based on consecutive image pairs from the data are employed. These measures employ linear dependences, statistical measures, and spatial relationships to compute radiometric, spectral, and texture changes that offer a description for the multitemporal behavior of the SITS. During this process, the original SITS is converted to a change map time series (CMTS), which removes the static information from the data set. The CMTS is analyzed using a latent Dirichlet allocation (LDA) model capable of discovering classes with semantic meaning based on the latent information hidden in the scene. This statistical method was originally used for text classification, thus requiring a word, document, corpus analogy with the elements inside the image. The experimental results were computed using 11 Landsat images over the city of Bucharest and surrounding areas. The LDA model enables us to discover a wide range of scene evolution classes based on the various dynamic behaviors of the land cover. The results are compared with the Corinne Land Cover map. However, this is not a validation method but one that adds static knowledge about the general usage of the analyzed area. In order to help the interpretation of the results, we use several studies on forms of relief, weather forecast, and very high resolution images that can explain the wide range of structures responsible for influencing the dynamic inside the resolution cell.

32 citations

Journal ArticleDOI
01 Mar 2014
TL;DR: The proposed sLDA does not only reduce the model perplexity but also reduce the memory and computation costs, and Bayesian feature selection method does effectively identify relevant topic words for building sparse topic model.
Abstract: This paper presents a new Bayesian sparse learning approach to select salient lexical features for sparse topic modeling. The Bayesian learning based on latent Dirichlet allocation (LDA) is performed by incorporating the spike-and-slab priors. According to this sparse LDA (sLDA), the spike distribution is used to select salient words while the slab distribution is applied to establish the latent topic model based on those selected relevant words. The variational inference procedure is developed to estimate prior parameters for sLDA. In the experiments on document modeling using LDA and sLDA, we find that the proposed sLDA does not only reduce the model perplexity but also reduce the memory and computation costs. Bayesian feature selection method does effectively identify relevant topic words for building sparse topic model.

32 citations

Journal ArticleDOI
TL;DR: A probabilistic topic model is proposed, adapted from Latent Dirichlet Allocation (LDA), to discover representative and interpretable activity categorization from individual-level spatiotemporal data in an unsupervised manner and can successfully distinguish the three most basic types of activities.
Abstract: Although automatically collected human travel records can accurately capture the time and location of human movements, they do not directly explain the hidden semantic structures behind the data, e.g., activity types. This work proposes a probabilistic topic model, adapted from Latent Dirichlet Allocation (LDA), to discover representative and interpretable activity categorization from individual-level spatiotemporal data in an unsupervised manner. Specifically, the activity-travel episodes of an individual user are treated as words in a document, and each topic is a distribution over space and time that corresponds to certain type of activity. The model accounts for a mixture of discrete and continuous attributes—the location, start time of day, start day of week, and duration of each activity episode. The proposed methodology is demonstrated using pseudonymized transit smart card data from London, U.K. The results show that the model can successfully distinguish the three most basic types of activities—home, work, and other. As the specified number of activity categories increases, more specific subpatterns for home and work emerge, and both the goodness of fit and predictive performance for travel behavior improve. This work makes it possible to enrich human mobility data with representative and interpretable activity patterns without relying on predefined activity categories or heuristic rules.

32 citations


Network Information
Related Topics (5)
Cluster analysis
146.5K papers, 2.9M citations
86% related
Support vector machine
73.6K papers, 1.7M citations
86% related
Deep learning
79.8K papers, 2.1M citations
85% related
Feature extraction
111.8K papers, 2.1M citations
84% related
Convolutional neural network
74.7K papers, 2M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023323
2022842
2021418
2020429
2019473
2018446