scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Latent dirichlet allocation

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: The organization of the cerebral cortex was similar regardless of whether a winner-take-all approach or the more relaxed constraints of LDA (or ICA) were imposed, suggesting that large-scale networks may function as partially isolated modules.

221 citations


Cites background or methods from "Latent dirichlet allocation"

  • ..., 2011) and LDA model (Blei et al., 2003) to both the GSP and HCP group datasets, in order to examine how cortical network organization changes as regions are permitted to participate in multiple networks (Fig....

    [...]

  • ...We refer interested readers to Blei et al. (2003) for the probabilistic (and more wellknown) interpretation of LDA....

    [...]

  • ...LDA was first introduced in the text mining literature (Blei et al., 2003)....

    [...]

  • ...Here,we address the possibility ofmultiple networkmembership by applying latentDirichlet allocation (LDA; Blei et al., 2003) and spatial Independent Component Analysis (ICA; Calhoun et al....

    [...]

  • ...First, we applied the mixture model (Yeo et al., 2011) and LDA model (Blei et al., 2003) to both the GSP and HCP group datasets, in order to examine how cortical network organization changes as regions are permitted to participate in multiple networks (Fig....

    [...]

Patent
07 Oct 2009
TL;DR: In this article, the authors present a method for business intelligence metrics on unstructured data. But they do not specify how to classify the extracted data and metadata for each document, only that the retrieved data is automatically classified into one or more relevance classes.
Abstract: Various embodiments of the present invention disclose a method for Business Intelligence (BI) metrics on unstructured data. Unstructured data is collected from numerous data sources that include unstructured data as ingested data. The ingested data is indexed and represents hyperlink and extracted data and metadata for each document. Thereafter, the ingested data is automatically classified into one or more relevance classes. Further, numerous analytics are performed on the classified data to generate business intelligence metrics that may be presented on an access device operated by a user.

220 citations

Journal ArticleDOI
TL;DR: A new model for text analysis is proposed that makes use of the sentence structure contained in the reviews and it is shown that it leads to improved inference and prediction of consumer ratings relative to existing models using data from www.expedia.com and www.we8there.com.
Abstract: Firms collect an increasing amount of consumer feedback in the form of unstructured consumer reviews. These reviews contain text about consumer experiences with products and services that are different from surveys that query consumers for specific information. A challenge in analyzing unstructured consumer reviews is in making sense of the topics that are expressed in the words used to describe these experiences. We propose a new model for text analysis that makes use of the sentence structure contained in the reviews and show that it leads to improved inference and prediction of consumer ratings relative to existing models using data from www.expedia.com and www.we8there.com. Sentence-based topics are found to be more distinguished and coherent than those identified from a word-based analysis. Data, as supplemental material, are available at https://doi.org/10.1287/mksc.2016.0993.

220 citations


Cites background or methods from "Latent dirichlet allocation"

  • ...This is the key idea of latent topic modeling in Latent Dirichlet Allocation (Blei et al., 2003) and the Author-Topic Model (Rosen-Zvi et al....

    [...]

  • ...The model and analysis presented in this paper is based on a class of models that are generally known as “topic” models (Blei et al. 2003; Rosen-Zvi et al. 2010), where the words contained in a consumer review reflect a latent set of ideas or sentiments, each of which is expressed with its own vocabulary....

    [...]

  • ...1 Latent Dirichlet Allocation (LDA) Model A simple model for the analysis of latent topics in text data is the Latent Dirichelet Allocation (LDA) model (Blei et al., 2003)....

    [...]

  • ...The standard LDA model proposed by (Blei et al., 2003) employs a Bayesian approach to augment the unobserved topic assignments zw of the words w....

    [...]

  • ...A simple model for the analysis of latent topics in text data is the Latent Dirichelet Allocation (LDA) model (Blei et al., 2003)....

    [...]

Proceedings ArticleDOI
24 Jul 2011
TL;DR: A scalable two-stage personalized news recommendation approach with a two-level representation, which considers the exclusive characteristics of news items when performing recommendation, and a principled framework for news selection based on the intrinsic property of user interest is presented.
Abstract: Recommending news articles has become a promising research direction as the Internet provides fast access to real-time information from multiple sources around the world. Traditional news recommendation systems strive to adapt their services to individual users by virtue of both user and news content information. However, the latent relationships among different news items, and the special properties of new articles, such as short shelf lives and value of immediacy, render the previous approaches inefficient. In this paper, we propose a scalable two-stage personalized news recommendation approach with a two-level representation, which considers the exclusive characteristics (e.g., news content, access patterns, named entities, popularity and recency) of news items when performing recommendation. Also, a principled framework for news selection based on the intrinsic property of user interest is presented, with a good balance between the novelty and diversity of the recommended result. Extensive empirical experiments on a collection of news articles obtained from various news websites demonstrate the efficacy and efficiency of our approach.

219 citations


Cites background from "Latent dirichlet allocation"

  • ...Generally speaking, news content is often represented using vector space model (e.g., TF-IDF) [15], or topic distributions obtained by language models (e.g., PLSI and LDA), and specific similarity measurements are adopted to evaluate the relatedness between news articles....

    [...]

  • ...Blei argues that this step is cheating because the model is essentially refitted to the new data [3]....

    [...]

  • ...Discussion: The PLSI model and the LDA model are similar, except that in LDA the topic distribution is assumed to have a Dirichlet prior....

    [...]

  • ...Based on our analysis in Section 4.3, LDA tends to perform better than PLSI in terms of topic detection when the dataset is relatively small....

    [...]

  • ...From the result, we have the following observations: (i) LDA-based recommender system has stable recommendation performance in terms of F-score, regardless of different size of news corpus; and (ii) PLSI-based recommender system has comparable results when the news corpus becomes larger....

    [...]

Proceedings ArticleDOI
01 Oct 2017
TL;DR: It is shown that document frequency, document word length, and vocabulary size have mixed practical effects on topic coherence and human topic ranking of LDA topics, and that large document collections are less affected by incorrect or noise terms being part of the topic-word distributions, causing topics to be more coherent and ranked higher.
Abstract: This paper assesses topic coherence and human topic ranking of uncovered latent topics from scientific publications when utilizing the topic model latent Dirichlet allocation (LDA) on abstract and full-text data. The coherence of a topic, used as a proxy for topic quality, is based on the distributional hypothesis that states that words with similar meaning tend to co-occur within a similar context. Although LDA has gained much attention from machine-learning researchers, most notably with its adaptations and extensions, little is known about the effects of different types of textual data on generated topics. Our research is the first to explore these practical effects and shows that document frequency, document word length, and vocabulary size have mixed practical effects on topic coherence and human topic ranking of LDA topics. We furthermore show that large document collections are less affected by incorrect or noise terms being part of the topic-word distributions, causing topics to be more coherent and ranked higher. Differences between abstract and full-text data are more apparent within small document collections, with differences as large as 90% high-quality topics for full-text data, compared to 50% high-quality topics for abstract data.

219 citations


Cites background from "Latent dirichlet allocation"

  • ...One of the most popular and highly researched topic models is latent Dirichlet allocation (LDA) [6]....

    [...]

  • ...Unfortunately, computation of the posterior is intractable due to the denominator [6]....

    [...]

References
More filters
Book
01 Jan 1995
TL;DR: Detailed notes on Bayesian Computation Basics of Markov Chain Simulation, Regression Models, and Asymptotic Theorems are provided.
Abstract: FUNDAMENTALS OF BAYESIAN INFERENCE Probability and Inference Single-Parameter Models Introduction to Multiparameter Models Asymptotics and Connections to Non-Bayesian Approaches Hierarchical Models FUNDAMENTALS OF BAYESIAN DATA ANALYSIS Model Checking Evaluating, Comparing, and Expanding Models Modeling Accounting for Data Collection Decision Analysis ADVANCED COMPUTATION Introduction to Bayesian Computation Basics of Markov Chain Simulation Computationally Efficient Markov Chain Simulation Modal and Distributional Approximations REGRESSION MODELS Introduction to Regression Models Hierarchical Linear Models Generalized Linear Models Models for Robust Inference Models for Missing Data NONLINEAR AND NONPARAMETRIC MODELS Parametric Nonlinear Models Basic Function Models Gaussian Process Models Finite Mixture Models Dirichlet Process Models APPENDICES A: Standard Probability Distributions B: Outline of Proofs of Asymptotic Theorems C: Computation in R and Stan Bibliographic Notes and Exercises appear at the end of each chapter.

16,079 citations


"Latent dirichlet allocation" refers background in this paper

  • ...Finally, Griffiths and Steyvers (2002) have presented a Markov chain Monte Carlo algorithm for LDA....

    [...]

  • ...Structures similar to that shown in Figure 1 are often studied in Bayesian statistical modeling, where they are referred to ashierarchical models(Gelman et al., 1995), or more precisely asconditionally independent hierarchical models(Kass and Steffey, 1989)....

    [...]

  • ...Structures similar to that shown in Figure 1 are often studied in Bayesian statistical modeling, where they are referred to as hierarchical models (Gelman et al., 1995), or more precisely as conditionally independent hierarchical models (Kass and Steffey, 1989)....

    [...]

Journal ArticleDOI
TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.
Abstract: A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.

12,443 citations


"Latent dirichlet allocation" refers methods in this paper

  • ...To address these shortcomings, IR researchers have proposed several other dimensionality reduction techniques, most notably latent semantic indexing (LSI) (Deerwester et al., 1990)....

    [...]

  • ...To address these shortcomings, IR researchers have proposed several other dimensionality reduction techniques, most notablylatent semantic indexing (LSI)(Deerwester et al., 1990)....

    [...]

Book
01 Jan 1983
TL;DR: Reading is a need and a hobby at once and this condition is the on that will make you feel that you must read.
Abstract: Some people may be laughing when looking at you reading in your spare time. Some may be admired of you. And some may want be like you who have reading hobby. What about your own feel? Have you felt right? Reading is a need and a hobby at once. This condition is the on that will make you feel that you must read. If you know are looking for the book enPDFd introduction to modern information retrieval as the choice of reading, you can find here.

12,059 citations


"Latent dirichlet allocation" refers background or methods in this paper

  • ...In the populartf-idf scheme (Salton and McGill, 1983), a basic vocabulary of “words” or “terms” is chosen, and, for each document in the corpus, a count is formed of the number of occurrences of each word....

    [...]

  • ...We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model....

    [...]

Book
01 Jan 1939
TL;DR: In this paper, the authors introduce the concept of direct probabilities, approximate methods and simplifications, and significant importance tests for various complications, including one new parameter, and various complications for frequency definitions and direct methods.
Abstract: 1. Fundamental notions 2. Direct probabilities 3. Estimation problems 4. Approximate methods and simplifications 5. Significance tests: one new parameter 6. Significance tests: various complications 7. Frequency definitions and direct methods 8. General questions

7,086 citations