scispace - formally typeset
Search or ask a question

Showing papers on "Latent Dirichlet allocation published in 1999"


Journal ArticleDOI
01 Aug 1999
TL;DR: Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data.
Abstract: Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. In contrast to standard Latent Semantic Indexing (LSI) by Singular Value Decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. Retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over LSI. In particular, the combination of models with different dimensionalities has proven to be advantageous.

4,577 citations


Proceedings Article
30 Jul 1999
TL;DR: This work proposes a widely applicable generalization of maximum likelihood model fitting by tempered EM, based on a mixture decomposition derived from a latent class model which results in a more principled approach which has a solid foundation in statistics.
Abstract: Probabilistic Latent Semantic Analysis is a novel statistical technique for the analysis of two-mode and co-occurrence data, which has applications in information retrieval and filtering, natural language processing, machine learning from text, and in related areas. Compared to standard Latent Semantic Analysis which stems from linear algebra and performs a Singular Value Decomposition of co-occurrence tables, the proposed method is based on a mixture decomposition derived from a latent class model. This results in a more principled approach which has a solid foundation in statistics. In order to avoid overfitting, we propose a widely applicable generalization of maximum likelihood model fitting by tempered EM. Our approach yields substantial and consistent improvements over Latent Semantic Analysis in a number of experiments.

2,306 citations


Proceedings ArticleDOI
01 Aug 1999
TL;DR: A dual probability model is constructed for the Latent Semantic Indexing using the cosine similarity measure, establishing a statistical framework for LSI and leading to a statistical criterion for the optimal semantic dimensions.
Abstract: A dual probability model is constructed for the Latent Semantic Indexing (LSI) using the cosine similarity measure. Both the document-document similarity matrix and the term-term similarity matrix naturally arise from the maximum likelihood estimation of the model parameters, and the optimal solutions are the latent semantic vectors of of LSI. Dimensionality reduction is justi ed by the statistical signi cance of latent semantic vectors as measured by the likelihood of the model. This leads to a statistical criterion for the optimal semantic dimensions, answering a critical open question in LSI with practical importance. Thus the model establishes a statistical framework for LSI. Ambiguities related to statistical modeling of LSI are clari ed.

152 citations


Journal ArticleDOI
TL;DR: In this article, flexible methods that relax restrictive conditional independence assumptions of latent class analysis (LCA) are described, and the relationship between the multivariate probit mixture model proposed here and Rost's mixed Rasch (1990, 1991) model is discussed.
Abstract: Flexible methods that relax restrictive conditional independence assumptions of latent classanalysis (LCA) are described. Dichotomous and ordered category manifest variables are viewed asdiscretized latent continuous variables. The latent continuous variables are assumed to have a mixtureofmultivariate-normals distribution. Within a latent class, conditional dependence is modeled as the mutual association of all or some latent continuous variables with a continuous latent trait (or in special cases, multiple latent traits). The relaxation of conditional independence assumptions allows LCA to better model natural taxa. Comparisons of specific restricted and unrestricted models permit statistical tests of specific aspects of latent taxonic structure. Latent class, latent trait, and latent distribution analysis can be viewed as special cases of the mixed latent trait model. The relationship between the multivariate probit mixture model proposed here and Rost’s mixed Rasch (1990, 1991) model is discussed. Two...

96 citations


Journal ArticleDOI
TL;DR: In this paper, the main results obtained in semi-and non-parametric Bayesian analysis of duration models are reviewed in line with Ferguson's pioneering papers, and a Bayesian semiparametric version of the proportional hazards model is considered.
Abstract: The object of this paper is to review the main results obtained in semi- and non-parametric Bayesian analysis of duration models. Standard nonparametric Bayesian models for independent and identically distributed observations are reviewed in line with Ferguson's pioneering papers. Recent results on the characterization of Dirichlet processes and on nonparametric treatment of censoring and of heterogeneity in the context of mixtures of Dirichlet processes are also discussed. The final section considers a Bayesian semiparametric version of the proportional hazards model.

16 citations


Proceedings Article
01 Jan 1999
TL;DR: This model is shown to have several advantages over the Bayesian models based on a single Dirichlet prior, especially when 2 q is large and many patterns are thus unobserved by design.
Abstract: Bayesian implicative analysis was proposed for summarizing the association in a 2×2 contingency table in possibly asymmetrical terms such as “presence of feature a implies, usually, presence of feature b ” (“ a quasi-implies b ” in short). Here, we consider the multivariate version of this problem: having n units which are classified according to q binary questions, we want to summarize the association between questions in terms of quasi-implications between features. We will first show how, at a descriptive level, the notion of implication can be weakened into that of quasi-implication. The inductive step assumes that the n units are a sample from a 2 q -multinomial population. Uncertainty about the patterns’ true frequencies is expressed by an imprecise Dirichlet model which yields upper and lower posterior probabilities for any quasi-implicative statement. This model is shown to have several advantages over the Bayesian models based on a single Dirichlet prior, especially when 2 q is large and many patterns are thus unobserved by design.

2 citations