scispace - formally typeset
Search or ask a question
Topic

Latent Dirichlet allocation

About: Latent Dirichlet allocation is a research topic. Over the lifetime, 5351 publications have been published within this topic receiving 212555 citations. The topic is also known as: LDA.


Papers
More filters
Proceedings ArticleDOI
13 May 2013
TL;DR: This paper proposes a probabilistic graphical model based on LDA, called Factorized LDA (FLDA), to address the cold start problem and demonstrates the improved effectiveness of the FLDA model in terms of likelihood of the held-out test set.
Abstract: Aspect-based opinion mining from online reviews has attracted a lot of attention recently The main goal of all of the proposed methods is extracting aspects and/or estimating aspect ratings Recent works, which are often based on Latent Dirichlet Allocation (LDA), consider both tasks simultaneously These models are normally trained at the item level, ie, a model is learned for each item separately Learning a model per item is fine when the item has been reviewed extensively and has enough training data However, in real-life data sets such as those from Epinionscom and Amazoncom more than 90% of items have less than 10 reviews, so-called cold start items State-of-the-art LDA models for aspect-based opinion mining are trained at the item level and therefore perform poorly for cold start items due to the lack of sufficient training data In this paper, we propose a probabilistic graphical model based on LDA, called Factorized LDA (FLDA), to address the cold start problem The underlying assumption of FLDA is that aspects and ratings of a review are influenced not only by the item but also by the reviewer It further assumes that both items and reviewers can be modeled by a set of latent factors which represent their aspect and rating distributions Different from state-of-the-art LDA models, FLDA is trained at the category level and learns the latent factors using the reviews of all the items of a category, in particular the non cold start items, and uses them as prior for cold start items Our experiments on three real-life data sets demonstrate the improved effectiveness of the FLDA model in terms of likelihood of the held-out test set We also evaluate the accuracy of FLDA based on two application-oriented measures

81 citations

Journal ArticleDOI
TL;DR: This work addresses the topic-based community key-members extraction problem, for which the method combines both text mining and social network analysis techniques.
Abstract: The study of extremist groups and their interaction is a crucial task in order to maintain homeland security and peace. Tools such as social networks analysis and text mining have contributed to their understanding in order to develop counter-terrorism applications. This work addresses the topic-based community key-members extraction problem, for which our method combines both text mining and social network analysis techniques. This is achieved by first applying latent Dirichlet allocation to build two topic-based social networks in online forums: one social network oriented towards the thread creator point-of-view, and the other is oriented towards the repliers of the overall forum. Then, by using different network analysis measures, topic-based key members are evaluated using as benchmark a social network built a plain representation of the network of posts. Experiments were successfully performed using an English language based forum available in the Dark Web portal.

81 citations

Proceedings ArticleDOI
13 Oct 2015
TL;DR: A novel framework, namely multi- query expansions, to retrieve semantically robust landmarks by two steps is proposed, and a novel technique to generate the robust yet compact pattern set from the multi-query photos is proposed.
Abstract: Given a query photo issued by a user (q-user), the landmark retrieval is to return a set of photos with their landmarks similar to those of the query, while the existing studies on the landmark retrieval focus on exploiting geometries of landmarks for similarity matches between candidate photos and a query photo. We observe that the same landmarks provided by different users may convey different geometry information depending on the viewpoints and/or angles, and may subsequently yield very different results. In fact, dealing with the landmarks with shapes caused by the photography of q-users is often nontrivial and has never been studied. Motivated by this, in this paper we propose a novel framework, namely multi-query expansions, to retrieve semantically robust landmarks by two steps. Firstly, we identify the top-k photos regarding the latent topics of a query landmark to construct multi-query set so as to remedy its possible shape. For this purpose, we significantly extend the techniques of Latent Dirichlet Allocation. Secondly, we propose a novel technique to generate the robust yet compact pattern set from the multi-query photos. To ensure redundancy-free and enhance the efficiency, we adopt the existing minimum-description-length-principle based pattern mining techniques to remove similar query photos from the (k+1) selected query photos. Then, a landmark retrieval rule is developed to calculate the ranking scores between mined pattern set and each photo in the database, which are ranked to serve as the final ranking list of landmark retrieval. Extensive experiments are conducted on real-world landmark datasets, validating the significantly higher accuracy of our approach.

81 citations

Proceedings ArticleDOI
11 Dec 2008
TL;DR: The results demonstrate the effectiveness of probabilistic topic models in automatically summarizing the temporal dynamics of software concerns, with direct application to project management and program understanding, for two large, open source Java projects, Eclipse and Argo UML.
Abstract: We develop and apply unsupervised statistical topic models, in particular latent Dirichlet allocation, to identify functional components of source code and study their evolution over multiple project versions. We present results for two large, open source Java projects, Eclipse and Argo UML, which are well-known and well-studied within the software mining community. Our results demonstrate the effectiveness of probabilistic topic models in automatically summarizing the temporal dynamics of software concerns, with direct application to project management and program understanding. In addition to detecting the emergence of topics on the release timeline which represent integration points for key source code functionality, our techniques can also be used to pinpoint refactoring events in the underlying software design, as well as to identify general programming concepts whose prevalence is dependent only of the size of the code base to be analyzed. Complete results are available from our supplementary materials website at http://sourcerer.ics.uci.edu/icmla2008/software_evolution.html.

80 citations

Journal ArticleDOI
TL;DR: The key findings are that exclusion of comments and literals from the corpus lowers accuracy and that heuristics for selecting LDA parameter values in the natural language context are suboptimal in the source code context.
Abstract: Feature location is a program comprehension activity, the goal of which is to identify source code entities that implement a functionality. Recent feature location techniques apply text retrieval models such as latent Dirichlet allocation (LDA) to corpora built from text embedded in source code. These techniques are highly configurable, and the literature offers little insight into how different configurations affect their performance. In this paper we present a study of an LDA based feature location technique (FLT) in which we measure the performance effects of using different configurations to index corpora and to retrieve 618 features from 6 open source Java systems. In particular, we measure the effects of the query, the text extractor configuration, and the LDA parameter values on the accuracy of the LDA based FLT. Our key findings are that exclusion of comments and literals from the corpus lowers accuracy and that heuristics for selecting LDA parameter values in the natural language context are suboptimal in the source code context. Based on the results of our case study, we offer specific recommendations for configuring the LDA based FLT.

80 citations


Network Information
Related Topics (5)
Cluster analysis
146.5K papers, 2.9M citations
86% related
Support vector machine
73.6K papers, 1.7M citations
86% related
Deep learning
79.8K papers, 2.1M citations
85% related
Feature extraction
111.8K papers, 2.1M citations
84% related
Convolutional neural network
74.7K papers, 2M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023323
2022842
2021418
2020429
2019473
2018446