scispace - formally typeset
Search or ask a question
Topic

Latent Dirichlet allocation

About: Latent Dirichlet allocation is a research topic. Over the lifetime, 5351 publications have been published within this topic receiving 212555 citations. The topic is also known as: LDA.


Papers
More filters
Journal ArticleDOI
TL;DR: This study provides a structured topography for finance researchers seeking to integrate machine learning research approaches in their exploration of finance phenomena and showcases the benefits to finance researchers of the method of probabilistic modeling of topics for deep comprehension of a body of literature.
Abstract: We provide a first comprehensive structuring of the literature applying machine learning to finance. We use a probabilistic topic modeling approach to make sense of this diverse body of research spanning across the disciplines of finance, economics, computer sciences, and decision sciences. Through the topic modelling approach, a Latent Dirichlet Allocation technique, we are able to extract the 14 coherent research topics that are the focus of the 5,204 academic articles we analyze from the years 1990 to 2018. We first describe and structure these topics, and then further show how the topic focus has evolved over the last two decades. Our study thus provides a structured topography for finance researchers seeking to integrate machine learning research approaches in their exploration of finance phenomena. We also showcase the benefits to finance researchers of the method of probabilistic modeling of topics for deep comprehension of a body of literature, especially when that literature has diverse multi-disciplinary actors.

39 citations

Proceedings Article
07 Sep 2009
TL;DR: This paper introduces an approach based on Latent Dirichlet Allocation (LDA) for recommending tags of resources and evaluates recall and precision for the bibsonomy benchmark provided within the ECML PKDD Discovery Challenge 2009.
Abstract: Tagging systems have become major infrastructures on the Web. They allow users to create tags that annotate and categorize content and share them with other users, very helpful in particular for searching multimedia content. However, as tagging is not constrained by a controlled vocabulary and annotation guidelines, tags tend to be noisy and sparse. Especially new resources annotated by only a few users have often rather idiosyncratic tags that do not reect a common perspective useful for search. In this paper we introduce an approach based on Latent Dirichlet Allocation (LDA) for recommending tags of resources. Resources annotated by many users and thus equipped with a fairly stable and complete tag set are used to elicit latent topics represented as a mixture of description tokens and tags. Based on this, new resources are mapped to latent topics based on their content in order to recommend the most likely tags from the latent topics. We evaluate recall and precision for the bibsonomy benchmark provided within the ECML PKDD Discovery Challenge 2009.

39 citations

Journal ArticleDOI
TL;DR: An improved version of AdaBoost.MH is proposed, called RFBoost, which proposes two methods for ranking the features: One Boosting Round and Labeled Latent Dirichlet Allocation (LLDA), a supervised topic model based on Gibbs sampling.
Abstract: The AdaBoost.MH boosting algorithm is considered to be one of the most accurate algorithms for multi-label classification. AdaBoost.MH works by iteratively building a committee of weak hypotheses of decision stumps. In each round of AdaBoost.MH learning, all features are examined, but only one feature is used to build a new weak hypothesis. This learning mechanism may entail a high degree of computational time complexity, particularly in the case of a large-scale dataset. This paper describes a way to manage the learning complexity and improve the classification performance of AdaBoost.MH. We propose an improved version of AdaBoost.MH, called RFBoost . The weak learning in RFBoost is based on filtering a small fixed number of ranked features in each boosting round rather than using all features, as AdaBoost.MH does. We propose two methods for ranking the features: One Boosting Round and Labeled Latent Dirichlet Allocation (LLDA), a supervised topic model based on Gibbs sampling. Additionally, we investigate the use of LLDA as a feature selection method for reducing the feature space based on the maximal conditional probabilities of words across labels. Our experimental results on eight well-known benchmarks for multi-label text categorisation show that RFBoost is significantly more efficient and effective than the baseline algorithms. Moreover, the LLDA-based feature ranking yields the best performance for RFBoost.

39 citations

Journal ArticleDOI
TL;DR: A hierarchical Pitman-Yor-Dirichlet (HPYD) process is presented as the nonparametric priors to infer the predictive probabilities of the smoothed n-grams with the integrated topic information to reflect the properties of natural language in the estimated HPYD-LM.
Abstract: Probabilistic models are often viewed as insufficiently expressive because of strong limitation and assumption on the probabilistic distribution and the fixed model complexity. Bayesian nonparametric learning pursues an expressive probabilistic representation based on the nonparametric prior and posterior distributions with less assumption-laden approach to inference. This paper presents a hierarchical Pitman-Yor-Dirichlet (HPYD) process as the nonparametric priors to infer the predictive probabilities of the smoothed n-grams with the integrated topic information. A metaphor of hierarchical Chinese restaurant process is proposed to infer the HPYD language model (HPYD-LM) via Gibbs sampling. This process is equivalent to implement the hierarchical Dirichlet process-latent Dirichlet allocation (HDP-LDA) with the twisted hierarchical Pitman-Yor LM (HPY-LM) as base measures. Accordingly, we produce the power-law distributions and extract the semantic topics to reflect the properties of natural language in the estimated HPYD-LM. The superiority of HPYD-LM to HPY-LM and other language models is demonstrated by the experiments on model perplexity and speech recognition.

39 citations

Proceedings ArticleDOI
13 Dec 2010
TL;DR: By taking into account the sequential structure within a document, the SeqLDA model has a higher fidelity over LDA in terms of perplexity (a standard measure of dictionary-based compressibility) and yields a nicer sequential topic structure than LDA.
Abstract: Understanding how topics within a document evolve over its structure is an interesting and important problem. In this paper, we address this problem by presenting a novel variant of Latent Dirichlet Allocation (LDA): Sequential LDA (SeqLDA). This variant directly considers the underlying sequential structure, {\it i.e.}, a document consists of multiple segments ({\it e.g.}, chapters, paragraphs), each of which is correlated to its previous and subsequent segments. In our model, a document and its segments are modelled as random mixtures of the same set of latent topics, each of which is a distribution over words, and the topic distribution of each segment depends on that of its previous segment, the one for first segment will depend on the document topic distribution. The progressive dependency is captured by using the nested two-parameter Poisson Dirichlet process (PDP). We develop an efficient collapsed Gibbs sampling algorithm to sample from the posterior of the PDP. Our experimental results on patent documents show that by taking into account the sequential structure within a document, our SeqLDA model has a higher fidelity over LDA in terms of perplexity (a standard measure of dictionary-based compressibility). The SeqLDA model also yields a nicer sequential topic structure than LDA, as we show in experiments on books such as Melville's "The Whale''.

39 citations


Network Information
Related Topics (5)
Cluster analysis
146.5K papers, 2.9M citations
86% related
Support vector machine
73.6K papers, 1.7M citations
86% related
Deep learning
79.8K papers, 2.1M citations
85% related
Feature extraction
111.8K papers, 2.1M citations
84% related
Convolutional neural network
74.7K papers, 2M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023323
2022842
2021418
2020429
2019473
2018446