scispace - formally typeset
Search or ask a question

Showing papers on "Hierarchical Dirichlet process published in 2004"


Proceedings Article
01 Dec 2004
TL;DR: The hierarchical Dirichlet process (HDP), a nonparametric Bayesian model for clustering problems involving multiple groups of data, is proposed and experimental results are reported showing the effective and superior performance of the HDP over previous models.
Abstract: We propose the hierarchical Dirichlet process (HDP), a nonparametric Bayesian model for clustering problems involving multiple groups of data. Each group of data is modeled with a mixture, with the number of components being open-ended and inferred automatically by the model. Further, components can be shared across groups, allowing dependencies across groups to be modeled effectively as well as conferring generalization to new groups. Such grouped clustering problems occur often in practice, e.g. in the problem of topic discovery in document corpora. We report experimental results on three text corpora showing the effective and superior performance of the HDP over previous models.

474 citations


Journal ArticleDOI
TL;DR: The authors consider the Bayesian analysis of multinomial data in the presence of misclassi- fication to address problems of identifiability and permutation-type nonidentifiab ilities.
Abstract: The authors consider the Bayesian analysis of multinomial data in the presence of misclassi- fication. Misclassification of the multinomial cell entries leads to problems of identifiability which are categorized into two types. The first type, referred to as the permutation-type nonidentifiab ilities, may be handled with constraints that are suggested by the structure of the problem. Problems of identifiab ility of the second type are addressed with informative prior information via Dirichlet distributions. Computations are carried out using a Gibbs sampling algorithm.

45 citations


01 Jan 2004
TL;DR: In this article, a nonparametric Bayesian treatment for analyzing records containing occurrences of items is described, which retains the strength of previous approaches that explore the latent factors of each record (e.g. topics of documents), and further uncovers the clustering structure of records.
Abstract: This paper describes nonparametric Bayesian treatments for analyzing records containing occurrences of items. The introduced model retains the strength of previous approaches that explore the latent factors of each record (e.g. topics of documents), and further uncovers the clustering structure of records, which reflects the statistical dependencies of the latent factors. The nonparametric model induced by a Dirichlet process (DP) flexibly adapts model complexity to reveal the clustering structure of the data. To avoid the problems of dealing with infinite dimensions, we further replace the DP prior by a simpler alternative, namely Dirichlet-multinomial allocation (DMA), which maintains the main modelling properties of the DP. Instead of relying on Markov chain Monte Carlo (MCMC) for inference, this paper applies efficient variational inference based on DMA. The proposed approach yields encouraging empirical results on both a toy problem and text data. The results show that the proposed algorithm uncovers not only the latent factors, but also the clustering structure.

34 citations


Proceedings ArticleDOI
25 Jul 2004
TL;DR: An information retrieval method is proposed using a hierarchical Dirichlet process as a prior on the parameters of a set of multinomial distributions and naturally includes a number of features found in other popular methods.
Abstract: An information retrieval method is proposed using a hierarchical Dirichlet process as a prior on the parameters of a set of multinomial distributions. The resulting method naturally includes a number of features found in other popular methods. Specifically, tf.idf-like term weighting and document length normalisation are recovered. The new method is compared with Okapi BM-25 [3] and the Twenty-One model [1] on TREC data and is shown to give better performance.

20 citations


Proceedings Article
01 Jan 2004
TL;DR: An unsupervised algorithm based on the Maximum Likelihood (ML) and Fisher scoring methods is proposed for the estimation of the parameters of this mixture and this mixture is used to produce a new texture model.
Abstract: This paper presents a new finite mixture model based on the Multinomial Dirichlet distribution (MDD). For the estimation of the parameters of this mixture we propose an unsupervised algorithm based on the Maximum Likelihood (ML) and Fisher scoring methods. This mixture is used to produce a new texture model. Experimental results concern texture images summarizing and are reported on the Vistex texture image database from the MIT Media Lab.

8 citations


01 Jan 2004
TL;DR: In this article, a simple distributional relationship between V and the random variable (X-Y)^2, where X and Y are independent copies of the random mean of the Dirichlet process P, is established.
Abstract: A fundamental problem in a nonparametric Bayesian framework is the computation of the laws of functionals of random probability measures. For instance, in this context, testing hypotheses about the variance of the frequency distribution of a characteristic in a population requires the knowledge of its posterior distribution. The aim of this paper is to show some new results concerning the law of the functional variance V of a Dirichlet process P. In particular, we establish a simple distributional relationship between V and the random variable (X-Y)^2, where X and Y are independent copies of the random mean of the Dirichlet process P. Useful expressions for some integral transforms of V are also obtained and illustrative examples are given. Moreover, we discuss the correspondence between the distribution of the variance and the parameter of the Dirichlet process with given total mass. Finally two approximation procedures of the law of V are suggested

7 citations


01 Jan 2004
TL;DR: A model-based approach to identifying clusters of objects based on subsets of attributes, so that the attributes that distinguish a cluster from the rest of the population, called an attribute ensemble, may depend on the cluster being considered.
Abstract: We propose a model-based approach to identifying clusters of objects based on subsets of attributes, so that the attributes that distinguish a cluster from the rest of the population, called an attribute ensemble, may depend on the cluster being considered. The model is based on a P olya urn cluster model, which is equivalent to a Dirichlet process mixture of multivariate normal distributions. This model-based approach allows for the incorporation of applicationspecic data features into the clustering scheme. For example, in an analysis of genetic CGH array data we account for spatial correlation of genetic abnormalities along the genome.

4 citations