Topic

Latent Dirichlet allocation

About: Latent Dirichlet allocation is a research topic. Over the lifetime, 5351 publications have been published within this topic receiving 212555 citations. The topic is also known as: LDA.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Alternative prior assumptions for improving the performance of naïve Bayesian classifiers

[...]

Tzu-Tsung Wong¹•Institutions (1)

National Cheng Kung University¹

01 Apr 2009-Data Mining and Knowledge Discovery

TL;DR: This work proposes methods to construct appropriate generalized Dirichlet and Liouville priors for naïve Bayesian classifiers and results reveal that the generalizedDirichlet distribution has the best performance among the three distribution families.

...read moreread less

Abstract: The prior distribution of an attribute in a naive Bayesian classifier is typically assumed to be a Dirichlet distribution, and this is called the Dirichlet assumption. The variables in a Dirichlet random vector can never be positively correlated and must have the same confidence level as measured by normalized variance. Both the generalized Dirichlet and the Liouville distributions include the Dirichlet distribution as a special case. These two multivariate distributions, also defined on the unit simplex, are employed to investigate the impact of the Dirichlet assumption in naive Bayesian classifiers. We propose methods to construct appropriate generalized Dirichlet and Liouville priors for naive Bayesian classifiers. Our experimental results on 18 data sets reveal that the generalized Dirichlet distribution has the best performance among the three distribution families. Not only is the Dirichlet assumption inappropriate, but also forcing the variables in a prior to be all positively correlated can deteriorate the performance of the naive Bayesian classifier.

...read moreread less

36 citations

Posted Content•

Transfer Learning from LDA to BiLSTM-CNN for Offensive Language Detection in Twitter.

[...]

Gregor Wiedemann, Eugen Ruppert, Raghav Jindal, Chris Biemann

07 Nov 2018-arXiv: Computation and Language

TL;DR: Transfer learning in general improves offensive language detection and the effect of three different strategies to mitigate negative effects of 'catastrophic forgetting' during transfer learning is investigated.

...read moreread less

Abstract: We investigate different strategies for automatic offensive language classification on German Twitter data. For this, we employ a sequentially combined BiLSTM-CNN neural network. Based on this model, three transfer learning tasks to improve the classification performance with background knowledge are tested. We compare 1. Supervised category transfer: social media data annotated with near-offensive language categories, 2. Weakly-supervised category transfer: tweets annotated with emojis they contain, 3. Unsupervised category transfer: tweets annotated with topic clusters obtained by Latent Dirichlet Allocation (LDA). Further, we investigate the effect of three different strategies to mitigate negative effects of 'catastrophic forgetting' during transfer learning. Our results indicate that transfer learning in general improves offensive language detection. Best results are achieved from pre-training our model on the unsupervised topic clustering of tweets in combination with thematic user cluster information.

...read moreread less

36 citations

Book Chapter•DOI•

Applying latent dirichlet allocation to automatic essay grading

[...]

Tuomo Kakkonen¹, Niko Myller¹, Erkki Sutinen¹•Institutions (1)

University of Eastern Finland¹

23 Aug 2006

TL;DR: LDA is a “bag-of-words” type of language modeling and dimension reduction method reported to outperform other related methods, Latent Semantic Analysis (LSA) and Probabilistic LatentSemantic analysis (PLSA) in Information Retrieval (IR) domain.

...read moreread less

Abstract: We report experiments on automatic essay grading using Latent Dirichlet Allocation (LDA). LDA is a “bag-of-words” type of language modeling and dimension reduction method, reported to outperform other related methods, Latent Semantic Analysis (LSA) and Probabilistic Latent Semantic Analysis (PLSA) in Information Retrieval (IR) domain. We introduce LDA in detail and compare its strengths and weaknesses to LSA and PLSA. We also compare empirically the performance of LDA to LSA and PLSA. The experiments were run with three essay sets consisting in total of 283 essays from different domains. On contrary to the findings in IR, LDA achieved slightly worse results compared to LSA and PLSA in the experiments. We state the reasons for LSA and PLSA outperforming LDA and indicate further research directions.

...read moreread less

36 citations

Journal Article•DOI•

Graph-based clustering and ranking for diversified image search

[...]

Yan Yan¹, Gaowen Liu², Sen Wang¹, Jian Zhang³, Kai Zheng¹ - Show less +1 more•Institutions (3)

University of Queensland¹, University of Trento², Zhejiang International Studies University³

01 Feb 2017-Multimedia Systems

TL;DR: A novel ranking framework, namely cluster-constrained conditional Markov random walk (CCCMRW), which has two key steps: first, cluster images into topics, and then perform Markovrandom walk in an image graph conditioned on constraints of image cluster information is proposed.

...read moreread less

Abstract: In this paper, we consider the problem of clustering and re-ranking web image search results so as to improve diversity at high ranks. We propose a novel ranking framework, namely cluster-constrained conditional Markov random walk (CCCMRW), which has two key steps: first, cluster images into topics, and then perform Markov random walk in an image graph conditioned on constraints of image cluster information. In order to cluster the retrieval results of web images, a novel graph clustering model is proposed in this paper. We explore the surrounding text to mine the correlations between words and images and therefore the correlations are used to improve clustering results. Two kinds of correlations, namely word to image and word to word correlations, are mainly considered. As a standard text process technique, tf-idf method cannot measure the correlation of word to image directly. Therefore, we propose to combine tf-idf method with a novel feature of word, namely visibility, to infer the word-to-image correlation. By latent Dirichlet allocation model, we define a topic relevance function to compute the weights of word-to-word correlations. Taking word to image correlations as heterogeneous links and word-to-word correlations as homogeneous links, graph clustering algorithms, such as complex graph clustering and spectral co-clustering, are respectively used to cluster images into topics in this paper. In order to perform CCCMRW, a two-layer image graph is constructed with image cluster nodes as upper layer added to a base image graph. Conditioned on the image cluster information from upper layer, Markov random walk is constrained to incline to walk across different image clusters, so as to give high rank scores to images of different topics and therefore gain the diversity. Encouraging clustering and re-ranking outputs on Google image search results are reported in this paper.

...read moreread less

36 citations

Proceedings Article•

A Stochastic approximation method for inference in probabilistic graphical models

[...]

Peter Carbonetto¹, Matthew G. King², Firas Hamze•Institutions (2)

University of Chicago¹, University of British Columbia²

07 Dec 2009

TL;DR: A new algorithmic framework for inference in probabilistic models, and applies it to inference for latent Dirichlet allocation (LDA), which offers a principled means to exchange the variance of an importance sampling estimate for the bias incurred through variational approximation.

...read moreread less

Abstract: We describe a new algorithmic framework for inference in probabilistic models, and apply it to inference for latent Dirichlet allocation (LDA). Our framework adopts the methodology of variational inference, but unlike existing variational methods such as mean field and expectation propagation it is not restricted to tractable classes of approximating distributions. Our approach can also be viewed as a "population-based" sequential Monte Carlo (SMC) method, but unlike existing SMC methods there is no need to design the artificial sequence of distributions. Significantly, our framework offers a principled means to exchange the variance of an importance sampling estimate for the bias incurred through variational approximation. We conduct experiments on a difficult inference problem in population genetics, a problem that is related to inference for LDA. The results of these experiments suggest that our method can offer improvements in stability and accuracy over existing methods, and at a comparable cost.

...read moreread less

36 citations

Collapse

Network Information

Performance

Metrics

6,513

Papers

245,225

Citations

No. of papers in the topic in previous years
Year	Papers
2023	323
2022	842
2021	418
2020	429
2019	473
2018	446

Latent Dirichlet allocation

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics