scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Latent dirichlet allocation

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
22 Aug 2004
TL;DR: A generative probabilistic mixture model is proposed for comparative text mining that simultaneously performs cross-collection clustering and within- collection clustering, and can be applied to an arbitrary set of comparable text collections.
Abstract: In this paper, we define and study a novel text mining problem, which we refer to as Comparative Text Mining (CTM). Given a set of comparable text collections, the task of comparative text mining is to discover any latent common themes across all collections as well as summarize the similarity and differences of these collections along each common theme. This general problem subsumes many interesting applications, including business intelligence and opinion summarization. We propose a generative probabilistic mixture model for comparative text mining. The model simultaneously performs cross-collection clustering and within-collection clustering, and can be applied to an arbitrary set of comparable text collections. The model can be estimated efficiently using the Expectation-Maximization (EM) algorithm. We evaluate the model on two different text data sets (i.e., a news article data set and a laptop review data set), and compare it with a baseline clustering method also based on a mixture model. Experiment results show that the model is quite effective in discovering the latent common themes across collections and performs significantly better than our baseline mixture model.

270 citations


Cites result from "Latent dirichlet allocation"

  • ...The aspect models studied in [7, 2] are also related to our work but they are closer to our baseline model and are not designed for comparing multiple collections....

    [...]

Proceedings ArticleDOI
12 Aug 2012
TL;DR: A large-scale data mining approach to learning word-word relatedness, where known pairs of related words impose constraints on the learning process, and learns for each word a low-dimensional representation, which strives to maximize the likelihood of a word given the contexts in which it appears.
Abstract: Prior work on computing semantic relatedness of words focused on representing their meaning in isolation, effectively disregarding inter-word affinities. We propose a large-scale data mining approach to learning word-word relatedness, where known pairs of related words impose constraints on the learning process. We learn for each word a low-dimensional representation, which strives to maximize the likelihood of a word given the contexts in which it appears. Our method, called CLEAR, is shown to significantly outperform previously published approaches. The proposed method is based on first principles, and is generic enough to exploit diverse types of text corpora, while having the flexibility to impose constraints on the derived word similarities. We also make publicly available a new labeled dataset for evaluating word relatedness algorithms, which we believe to be the largest such dataset to date.

270 citations


Cites background or methods from "Latent dirichlet allocation"

  • ...Latent Dirichlet Allocation (LDA) [3] can also be used for computing word relatedness by representing words as vectors of probabilities over each topic....

    [...]

  • ...LDA [3] is a generative model that explains word occurrences in documents through latent topics underlying the document and the word distributions....

    [...]

Journal ArticleDOI
17 Mar 2011-PLOS ONE
TL;DR: PubMed Related Articles' own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts.
Abstract: BackgroundWe investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents.MethodologyWe used a corpus of 2.15 million recent (2004-2008) records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings. The nine approaches were comprised of five different analytical techniques with two data sources. The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, and two Poisson-based language models – BM25 and PMRA (PubMed Related Articles). The two data sources were a) MeSH subject headings, and b) words from titles and abstracts. Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering. Cluster results from the nine similarity approaches were compared using (1) within-cluster textual coherence based on the Jensen-Shannon divergence, and (2) two concentration measures based on grant-to-article linkages indexed in MEDLINE.ConclusionsPubMed's own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts. Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts.

270 citations


Cites background from "Latent dirichlet allocation"

  • ...The topic model – a recently-developed Bayesian model for text document collections [38] – is considered a state-of-the-art algorithm for extracting semantic structure from text collections....

    [...]

Journal ArticleDOI
TL;DR: This work describes probabilistic graphical models, a language for formulating latent variable models, and describes mean field variational inference, a generic algorithm for approximating conditional distributions.
Abstract: We survey latent variable models for solving data-analysis problems. A latent variable model is a probabilistic model that encodes hidden patterns in the data. We uncover these patterns from their conditional distribution and use them to summarize data and form predictions. Latent variable models are important in many fields, including computational biology, natural language processing, and social network analysis. Our perspective is that models are developed iteratively: We build a model, use it to analyze data, assess how it succeeds and fails, revise it, and repeat. We describe how new research has transformed these essential activities. First, we describe probabilistic graphical models, a language for formulating latent variable models. Second, we describe mean field variational inference, a generic algorithm for approximating conditional distributions. Third, we describe how to use our analyses to solve problems: exploring the data, forming predictions, and pointing us in the direction of improved mo...

270 citations


Cites background from "Latent dirichlet allocation"

  • ...For example, mixed-membership models of documents—where each document is a group of observed words—are called topic models (Blei et al., 2003; Erosheva et al., 2004; Steyvers and Griffiths, 2006; Blei, 2012)....

    [...]

  • ...In these settings, mixed-membership models assume that there is a single set of coherent patterns underlying the data—themes in text (Blei et al., 2003), populations in genetic data (Pritchard et al....

    [...]

  • ...In these settings, mixed-membership models assume that there is a single set of coherent patterns underlying the data—themes in text (Blei et al., 2003), populations in genetic data (Pritchard et al., 2000), types of respondents in survey data (Erosheva et al., 2007), and communities in networks…...

    [...]

Proceedings Article
23 Apr 2012
TL;DR: This work proposes a simple and effective way to guide topic models to learn topics of specific interest to a user by providing sets of seed words that a user believes are representative of the underlying topics in a corpus.
Abstract: Topic models have great potential for helping users understand document corpora. This potential is stymied by their purely unsupervised nature, which often leads to topics that are neither entirely meaningful nor effective in extrinsic tasks (Chang et al., 2009). We propose a simple and effective way to guide topic models to learn topics of specific interest to a user. We achieve this by providing sets of seed words that a user believes are representative of the underlying topics in a corpus. Our model uses these seeds to improve both topic-word distributions (by biasing topics to produce appropriate seed words) and to improve document-topic distributions (by biasing documents to select topics related to the seed words they contain). Extrinsic evaluation on a document clustering task reveals a significant improvement when using seed information, even over other models that use seed information naively.

269 citations


Cites background or methods from "Latent dirichlet allocation"

  • ...Topic models such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003) have emerged as a powerful tool to analyze document collections in an unsupervised fashion....

    [...]

  • ...Before moving on to the details of our models, we briefly recall the generative story of the LDA model and the reader is encouraged to refer to (Blei et al., 2003) for further details....

    [...]

References
More filters
Book
01 Jan 1995
TL;DR: Detailed notes on Bayesian Computation Basics of Markov Chain Simulation, Regression Models, and Asymptotic Theorems are provided.
Abstract: FUNDAMENTALS OF BAYESIAN INFERENCE Probability and Inference Single-Parameter Models Introduction to Multiparameter Models Asymptotics and Connections to Non-Bayesian Approaches Hierarchical Models FUNDAMENTALS OF BAYESIAN DATA ANALYSIS Model Checking Evaluating, Comparing, and Expanding Models Modeling Accounting for Data Collection Decision Analysis ADVANCED COMPUTATION Introduction to Bayesian Computation Basics of Markov Chain Simulation Computationally Efficient Markov Chain Simulation Modal and Distributional Approximations REGRESSION MODELS Introduction to Regression Models Hierarchical Linear Models Generalized Linear Models Models for Robust Inference Models for Missing Data NONLINEAR AND NONPARAMETRIC MODELS Parametric Nonlinear Models Basic Function Models Gaussian Process Models Finite Mixture Models Dirichlet Process Models APPENDICES A: Standard Probability Distributions B: Outline of Proofs of Asymptotic Theorems C: Computation in R and Stan Bibliographic Notes and Exercises appear at the end of each chapter.

16,079 citations


"Latent dirichlet allocation" refers background in this paper

  • ...Finally, Griffiths and Steyvers (2002) have presented a Markov chain Monte Carlo algorithm for LDA....

    [...]

  • ...Structures similar to that shown in Figure 1 are often studied in Bayesian statistical modeling, where they are referred to ashierarchical models(Gelman et al., 1995), or more precisely asconditionally independent hierarchical models(Kass and Steffey, 1989)....

    [...]

  • ...Structures similar to that shown in Figure 1 are often studied in Bayesian statistical modeling, where they are referred to as hierarchical models (Gelman et al., 1995), or more precisely as conditionally independent hierarchical models (Kass and Steffey, 1989)....

    [...]

Journal ArticleDOI
TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.
Abstract: A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.

12,443 citations


"Latent dirichlet allocation" refers methods in this paper

  • ...To address these shortcomings, IR researchers have proposed several other dimensionality reduction techniques, most notably latent semantic indexing (LSI) (Deerwester et al., 1990)....

    [...]

  • ...To address these shortcomings, IR researchers have proposed several other dimensionality reduction techniques, most notablylatent semantic indexing (LSI)(Deerwester et al., 1990)....

    [...]

Book
01 Jan 1983
TL;DR: Reading is a need and a hobby at once and this condition is the on that will make you feel that you must read.
Abstract: Some people may be laughing when looking at you reading in your spare time. Some may be admired of you. And some may want be like you who have reading hobby. What about your own feel? Have you felt right? Reading is a need and a hobby at once. This condition is the on that will make you feel that you must read. If you know are looking for the book enPDFd introduction to modern information retrieval as the choice of reading, you can find here.

12,059 citations


"Latent dirichlet allocation" refers background or methods in this paper

  • ...In the populartf-idf scheme (Salton and McGill, 1983), a basic vocabulary of “words” or “terms” is chosen, and, for each document in the corpus, a count is formed of the number of occurrences of each word....

    [...]

  • ...We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model....

    [...]

Book
01 Jan 1939
TL;DR: In this paper, the authors introduce the concept of direct probabilities, approximate methods and simplifications, and significant importance tests for various complications, including one new parameter, and various complications for frequency definitions and direct methods.
Abstract: 1. Fundamental notions 2. Direct probabilities 3. Estimation problems 4. Approximate methods and simplifications 5. Significance tests: one new parameter 6. Significance tests: various complications 7. Frequency definitions and direct methods 8. General questions

7,086 citations