Topic

Latent Dirichlet allocation

About: Latent Dirichlet allocation is a research topic. Over the lifetime, 5351 publications have been published within this topic receiving 212555 citations. The topic is also known as: LDA.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings

[...]

Ivan Vulić¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

09 Aug 2015

TL;DR: A novel word representation learning model called Bilingual Word Embeddings Skip-Gram (BWESG) is presented which is the first model able to learn bilingual word embeddings solely on the basis of document-aligned comparable data.

...read moreread less

Abstract: We propose a new unified framework for monolingual (MoIR) and cross-lingual information retrieval (CLIR) which relies on the induction of dense real-valued word vectors known as word embeddings (WE) from comparable data. To this end, we make several important contributions: (1) We present a novel word representation learning model called Bilingual Word Embeddings Skip-Gram (BWESG) which is the first model able to learn bilingual word embeddings solely on the basis of document-aligned comparable data; (2) We demonstrate a simple yet effective approach to building document embeddings from single word embeddings by utilizing models from compositional distributional semantics. BWESG induces a shared cross-lingual embedding vector space in which both words, queries, and documents may be presented as dense real-valued vectors; (3) We build novel ad-hoc MoIR and CLIR models which rely on the induced word and document embeddings and the shared cross-lingual embedding space; (4) Experiments for English and Dutch MoIR, as well as for English-to-Dutch and Dutch-to-English CLIR using benchmarking CLEF 2001-2003 collections and queries demonstrate the utility of our WE-based MoIR and CLIR models. The best results on the CLEF collections are obtained by the combination of the WE-based approach and a unigram language model. We also report on significant improvements in ad-hoc IR tasks of our WE-based framework over the state-of-the-art framework for learning text representations from comparable data based on latent Dirichlet allocation (LDA).

...read moreread less

303 citations

Journal Article•DOI•

Bug localization using latent Dirichlet allocation

[...]

Stacy K. Lukins¹, Nicholas A. Kraft², Letha H. Etzkorn¹•Institutions (2)

University of Alabama in Huntsville¹, University of Alabama²

01 Sep 2010-Information & Software Technology

TL;DR: An effective static technique for automatic bug localization can be built around Latent Dirichlet allocation (LDA), and there is no significant relationship between the accuracy of the LDA-based technique and the size of the subject software system or the stability of its source code base.

...read moreread less

Abstract: Context: Some recent static techniques for automatic bug localization have been built around modern information retrieval (IR) models such as latent semantic indexing (LSI). Latent Dirichlet allocation (LDA) is a generative statistical model that has significant advantages, in modularity and extensibility, over both LSI and probabilistic LSI (pLSI). Moreover, LDA has been shown effective in topic model based information retrieval. In this paper, we present a static LDA-based technique for automatic bug localization and evaluate its effectiveness. Objective: We evaluate the accuracy and scalability of the LDA-based technique and investigate whether it is suitable for use with open-source software systems of varying size, including those developed using agile methods. Method: We present five case studies designed to determine the accuracy and scalability of the LDA-based technique, as well as its relationships to software system size and to source code stability. The studies examine over 300 bugs across more than 25 iterations of three software systems. Results: The results of the studies show that the LDA-based technique maintains sufficient accuracy across all bugs in a single iteration of a software system and is scalable to a large number of bugs across multiple revisions of two software systems. The results of the studies also indicate that the accuracy of the LDA-based technique is not affected by the size of the subject software system or by the stability of its source code base. Conclusion: We conclude that an effective static technique for automatic bug localization can be built around LDA. We also conclude that there is no significant relationship between the accuracy of the LDA-based technique and the size of the subject software system or the stability of its source code base. Thus, the LDA-based technique is widely applicable.

...read moreread less

299 citations

Proceedings Article•DOI•

fLDA: matrix factorization through latent dirichlet allocation

[...]

Deepak Agarwal¹, Bee-Chung Chen¹•Institutions (1)

Yahoo!¹

04 Feb 2010

TL;DR: The authors proposed fLDA, a novel matrix factorization method to predict ratings in recommender system applications where a "bag-of-words" representation for item meta-data is natural Such scenarios are commonplace in web applications like content recommendation, ad targeting and web search where items are articles, ads and web pages respectively Because of data sparseness, regularization is key to good predictive accuracy.

...read moreread less

Abstract: We propose fLDA, a novel matrix factorization method to predict ratings in recommender system applications where a "bag-of-words" representation for item meta-data is natural Such scenarios are commonplace in web applications like content recommendation, ad targeting and web search where items are articles, ads and web pages respectively Because of data sparseness, regularization is key to good predictive accuracy Our method works by regularizing both user and item factors simultaneously through user features and the bag of words associated with each item Specifically, each word in an item is associated with a discrete latent factor often referred to as the topic of the word; item topics are obtained by averaging topics across all words in an item Then, user rating on an item is modeled as user's affinity to the item's topics where user affinity to topics (user factors) and topic assignments to words in items (item factors) are learned jointly in a supervised fashion To avoid overfitting, user and item factors are regularized through Gaussian linear regression and Latent Dirichlet Allocation (LDA) priors respectively We show our model is accurate, interpretable and handles both cold-start and warm-start scenarios seamlessly through a single model The efficacy of our method is illustrated on benchmark datasets and a new dataset from Yahoo! Buzz where fLDA provides superior predictive accuracy in cold-start scenarios and is comparable to state-of-the-art methods in warm-start scenarios As a by-product, fLDA also identifies interesting topics that explains user-item interactions Our method also generalizes a recently proposed technique called supervised LDA (sLDA) to collaborative filtering applications While sLDA estimates item topic vectors in a supervised fashion for a single regression, fLDA incorporates multiple regressions (one for each user) in estimating the item factors

...read moreread less

296 citations

Proceedings Article•

A Latent Dirichlet Model for Unsupervised Entity Resolution

[...]

Indrajit Bhattacharya, Lise Getoor

19 Aug 2005

TL;DR: This work proposes a novel sampling algorithm for collective entity resolution which is unsupervised and also takes entity relations into account, and demonstrates the utility and practicality of the relational entity resolution approach for author resolution in two real-world bibliographic datasets.

...read moreread less

Abstract: Entity resolution has received considerable attention in recent years. Given many references to underlying entities, the goal is to predict which references correspond to the same entity. We show how to extend the Latent Dirichlet Allocation model for this task and propose a probabilistic model for collective entity resolution for relational domains where references are connected to each other. Our approach differs from other recently proposed entity resolution approaches in that it is a) generative, b) does not make pair-wise decisions and c) captures relations between entities through a hidden group variable. We propose a novel sampling algorithm for collective entity resolution which is unsupervised and also takes entity relations into account. Additionally, we do not assume the domain of entities to be known and show how to infer the number of entities from the data. We demonstrate the utility and practicality of our relational entity resolution approach for author resolution in two real-world bibliographic datasets. In addition, we present preliminary results on characterizing conditions under which relational information is useful.

...read moreread less

293 citations

Posted Content•

Streaming Variational Bayes

[...]

Tamara Broderick¹, Nicholas Boyd¹, Andre Wibisono¹, Ashia C. Wilson¹, Michael I. Jordan¹ - Show less +1 more•Institutions (1)

University of California, Berkeley¹

25 Jul 2013-arXiv: Machine Learning

TL;DR: SDA-Bayes as mentioned in this paper is a framework for streaming and distributed computation of a Bayesian posterior, which makes streaming updates to the estimated posterior according to a user-specified approximation batch primitive.

...read moreread less

Abstract: We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-specified approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two large-scale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data---a case where SVI may be applied---and in the streaming setting, where SVI does not apply.

...read moreread less

291 citations

Collapse

Network Information

Performance

Metrics

6,513

Papers

245,225

Citations

No. of papers in the topic in previous years
Year	Papers
2023	323
2022	842
2021	418
2020	429
2019	473
2018	446

Latent Dirichlet allocation

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics