scispace - formally typeset
Search or ask a question

A constructive definition of dirichlet priors

About: The article was published on 1991-01-01 and is currently open access. It has received 1560 citations till now. The article focuses on the topics: Hierarchical Dirichlet process & Constructive.
Citations
More filters
Journal ArticleDOI
TL;DR: This work considers problems involving groups of data where each observation within a group is a draw from a mixture model and where it is desirable to share mixture components between groups, and considers a hierarchical model, specifically one in which the base measure for the childDirichlet processes is itself distributed according to a Dirichlet process.
Abstract: We consider problems involving groups of data where each observation within a group is a draw from a mixture model and where it is desirable to share mixture components between groups. We assume that the number of mixture components is unknown a priori and is to be inferred from the data. In this setting it is natural to consider sets of Dirichlet processes, one for each group, where the well-known clustering property of the Dirichlet process provides a nonparametric prior for the number of mixture components within each group. Given our desire to tie the mixture models in the various groups, we consider a hierarchical model, specifically one in which the base measure for the child Dirichlet processes is itself distributed according to a Dirichlet process. Such a base measure being discrete, the child Dirichlet processes necessarily share atoms. Thus, as desired, the mixture models in the different groups necessarily share mixture components. We discuss representations of hierarchical Dirichlet processes ...

3,755 citations


Cites background from "A constructive definition of dirich..."

  • ...was given by Sethuraman (1994), who showed that if G ∼ DP(α0, G0), then with probability one:...

    [...]

Journal ArticleDOI
11 Dec 2015-Science
TL;DR: A computational model is described that learns in a similar fashion and does so better than current deep learning algorithms and can generate new letters of the alphabet that look “right” as judged by Turing-like tests of the model's output in comparison to what real humans produce.
Abstract: People learning new concepts can often generalize successfully from just a single example, yet machine learning algorithms typically require tens or hundreds of examples to perform with similar accuracy. People can also use learned concepts in richer ways than conventional algorithms-for action, imagination, and explanation. We present a computational model that captures these human learning abilities for a large class of simple visual concepts: handwritten characters from the world's alphabets. The model represents concepts as simple programs that best explain observed examples under a Bayesian criterion. On a challenging one-shot classification task, the model achieves human-level performance while outperforming recent deep learning approaches. We also present several "visual Turing tests" probing the model's creative generalization abilities, which in many cases are indistinguishable from human behavior.

2,364 citations

Journal ArticleDOI
TL;DR: Stochastic variational inference lets us apply complex Bayesian models to massive data sets, and it is shown that the Bayesian nonparametric topic model outperforms its parametric counterpart.
Abstract: We develop stochastic variational inference, a scalable algorithm for approximating posterior distributions. We develop this technique for a large class of probabilistic models and we demonstrate it with two probabilistic topic models, latent Dirichlet allocation and the hierarchical Dirichlet process topic model. Using stochastic variational inference, we analyze several large collections of documents: 300K articles from Nature, 1.8M articles from The New York Times, and 3.8M articles from Wikipedia. Stochastic inference can easily handle data sets of this size and outperforms traditional variational inference, which can only handle a smaller subset. (We also show that the Bayesian nonparametric topic model outperforms its parametric counterpart.) Stochastic variational inference lets us apply complex Bayesian models to massive data sets.

2,291 citations

Journal ArticleDOI
TL;DR: Two general types of Gibbs samplers that can be used to fit posteriors of Bayesian hierarchical models based on stick-breaking priors are presented and the blocked Gibbs sampler, based on an entirely different approach that works by directly sampling values from the posterior of the random measure.
Abstract: A rich and flexible class of random probability measures, which we call stick-breaking priors, can be constructed using a sequence of independent beta random variables. Examples of random measures that have this characterization include the Dirichlet process, its two-parameter extension, the two-parameter Poisson–Dirichlet process, finite dimensional Dirichlet priors, and beta two-parameter processes. The rich nature of stick-breaking priors offers Bayesians a useful class of priors for nonparametric problems, while the similar construction used in each prior can be exploited to develop a general computational procedure for fitting them. In this article we present two general types of Gibbs samplers that can be used to fit posteriors of Bayesian hierarchical models based on stick-breaking priors. The first type of Gibbs sampler, referred to as a Polya urn Gibbs sampler, is a generalized version of a widely used Gibbs sampling method currently employed for Dirichlet process computing. This method applies t...

1,701 citations

Journal ArticleDOI
TL;DR: A variational inference algorithm forDP mixtures is presented and experiments that compare the algorithm to Gibbs sampling algorithms for DP mixtures of Gaussians and present an application to a large-scale image analysis problem are presented.
Abstract: Dirichlet process (DP) mixture models are the cornerstone of non- parametric Bayesian statistics, and the development of Monte-Carlo Markov chain (MCMC) sampling methods for DP mixtures has enabled the application of non- parametric Bayesian methods to a variety of practical data analysis problems. However, MCMC sampling can be prohibitively slow, and it is important to ex- plore alternatives. One class of alternatives is provided by variational methods, a class of deterministic algorithms that convert inference problems into optimization problems (Opper and Saad 2001; Wainwright and Jordan 2003). Thus far, varia- tional methods have mainly been explored in the parametric setting, in particular within the formalism of the exponential family (Attias 2000; Ghahramani and Beal 2001; Blei et al. 2003). In this paper, we present a variational inference algorithm for DP mixtures. We present experiments that compare the algorithm to Gibbs sampling algorithms for DP mixtures of Gaussians and present an application to a large-scale image analysis problem.

1,471 citations

References
More filters
Journal ArticleDOI
TL;DR: In this article, a class of prior distributions, called Dirichlet process priors, is proposed for nonparametric problems, for which treatment of many non-parametric statistical problems may be carried out, yielding results that are comparable to the classical theory.
Abstract: The Bayesian approach to statistical problems, though fruitful in many ways, has been rather unsuccessful in treating nonparametric problems. This is due primarily to the difficulty in finding workable prior distributions on the parameter space, which in nonparametric ploblems is taken to be a set of probability distributions on a given sample space. There are two desirable properties of a prior distribution for nonparametric problems. (I) The support of the prior distribution should be large--with respect to some suitable topology on the space of probability distributions on the sample space. (II) Posterior distributions given a sample of observations from the true probability distribution should be manageable analytically. These properties are antagonistic in the sense that one may be obtained at the expense of the other. This paper presents a class of prior distributions, called Dirichlet process priors, broad in the sense of (I), for which (II) is realized, and for which treatment of many nonparametric statistical problems may be carried out, yielding results that are comparable to the classical theory. In Section 2, we review the properties of the Dirichlet distribution needed for the description of the Dirichlet process given in Section 3. Briefly, this process may be described as follows. Let $\mathscr{X}$ be a space and $\mathscr{A}$ a $\sigma$-field of subsets, and let $\alpha$ be a finite non-null measure on $(\mathscr{X}, \mathscr{A})$. Then a stochastic process $P$ indexed by elements $A$ of $\mathscr{A}$, is said to be a Dirichlet process on $(\mathscr{X}, \mathscr{A})$ with parameter $\alpha$ if for any measurable partition $(A_1, \cdots, A_k)$ of $\mathscr{X}$, the random vector $(P(A_1), \cdots, P(A_k))$ has a Dirichlet distribution with parameter $(\alpha(A_1), \cdots, \alpha(A_k)). P$ may be considered a random probability measure on $(\mathscr{X}, \mathscr{A})$, The main theorem states that if $P$ is a Dirichlet process on $(\mathscr{X}, \mathscr{A})$ with parameter $\alpha$, and if $X_1, \cdots, X_n$ is a sample from $P$, then the posterior distribution of $P$ given $X_1, \cdots, X_n$ is also a Dirichlet process on $(\mathscr{X}, \mathscr{A})$ with a parameter $\alpha + \sum^n_1 \delta_{x_i}$, where $\delta_x$ denotes the measure giving mass one to the point $x$. In Section 4, an alternative definition of the Dirichlet process is given. This definition exhibits a version of the Dirichlet process that gives probability one to the set of discrete probability measures on $(\mathscr{X}, \mathscr{A})$. This is in contrast to Dubins and Freedman [2], whose methods for choosing a distribution function on the interval [0, 1] lead with probability one to singular continuous distributions. Methods of choosing a distribution function on [0, 1] that with probability one is absolutely continuous have been described by Kraft [7]. The general method of choosing a distribution function on [0, 1], described in Section 2 of Kraft and van Eeden [10], can of course be used to define the Dirichlet process on [0, 1]. Special mention must be made of the papers of Freedman and Fabius. Freedman [5] defines a notion of tailfree for a distribution on the set of all probability measures on a countable space $\mathscr{X}$. For a tailfree prior, posterior distribution given a sample from the true probability measure may be fairly easily computed. Fabius [3] extends the notion of tailfree to the case where $\mathscr{X}$ is the unit interval [0, 1], but it is clear his extension may be made to cover quite general $\mathscr{X}$. With such an extension, the Dirichlet process would be a special case of a tailfree distribution for which the posterior distribution has a particularly simple form. There are disadvantages to the fact that $P$ chosen by a Dirichlet process is discrete with probability one. These appear mainly because in sampling from a $P$ chosen by a Dirichlet process, we expect eventually to see one observation exactly equal to another. For example, consider the goodness-of-fit problem of testing the hypothesis $H_0$ that a distribution on the interval [0, 1] is uniform. If on the alternative hypothesis we place a Dirichlet process prior with parameter $\alpha$ itself a uniform measure on [0, 1], and if we are given a sample of size $n \geqq 2$, the only nontrivial nonrandomized Bayes rule is to reject $H_0$ if and only if two or more of the observations are exactly equal. This is really a test of the hypothesis that a distribution is continuous against the hypothesis that it is discrete. Thus, there is still a need for a prior that chooses a continuous distribution with probability one and yet satisfies properties (I) and (II). Some applications in which the possible doubling up of the values of the observations plays no essential role are presented in Section 5. These include the estimation of a distribution function, of a mean, of quantiles, of a variance and of a covariance. A two-sample problem is considered in which the Mann-Whitney statistic, equivalent to the rank-sum statistic, appears naturally. A decision theoretic upper tolerance limit for a quantile is also treated. Finally, a hypothesis testing problem concerning a quantile is shown to yield the sign test. In each of these problems, useful ways of combining prior information with the statistical observations appear. Other applications exist. In his Ph. D. dissertation [1], Charles Antoniak finds a need to consider mixtures of Dirichlet processes. He treats several problems, including the estimation of a mixing distribution, bio-assay, empirical Bayes problems, and discrimination problems.

5,033 citations

Journal ArticleDOI
TL;DR: In this article, it was shown that a random probability measure P* on X has a Ferguson distribution with parameter p if for every finite partition (B1, *. *, B) of X, the vector p*(B,), * * *, p *(B) has a Dirichlet distribution with parameters (Bj), *--, cp(B,) (when p(B), = 0, this means p*) = 0 with probability 1).
Abstract: Let p be any finite positive measure on (the Borel sets of) a complete separable metric space X. We shall say that a random probability measure P* on X has a Ferguson distribution with parameter p if for every finite partition (B1, * . *, B) of X the vector p*(B,), * * *, p*(B,) has a Dirichlet distribution with parameter (Bj), *--, cp(B,) (when p(B,) = 0, this means p*(B,) = 0 with probability 1). Ferguson (3) has shown that, for any p, Ferguson p* exist and when used as prior distributions yield Bayesian counterparts to well-known classical nonpa- rametric tests. He also shows that p* is a.s. discrete. His approach involves a rather deep study of the gamma process. One of us (1) has given a different and perhaps simpler proof that Ferguson priors concentrate on discrete distributions. In this note we give still a third approach to Ferguson distributions, exploiting their connection with generalized Polya urn schemes. We shall say that a sequence (X,, n > 1} of random variables with values in X is a Poilya sequence with parameter 1a if for every B c X (1) P(X1 e B) = p(B)/p(X) and (2) P{X,+1 e B I1 **,, X = pn(B)/1p(X) where p. = p + 3 l(Xi) and 3(x) denotes the unit measure concentrating at x. Note that, for finite X, the sequence {XJ} represents the results of successive draws from an urn where initially the urn has p(x) balls of color x and, after each draw, the ball drawn is replaced and another ball of its same color is added to the urn. Note also that, without the restriction to finite X, for any (Borel measurable) function zS on X, the sequence {0(X")} is a P6lya sequence with parameter qSp, where q4(A) = p{l e Al. We now describe the connections between Polya sequences and Ferguson distributions.

1,469 citations

Journal ArticleDOI
TL;DR: For the solutions of certain random equations, or equivalently the stationary solutions of the random recurrences, the distribution tails are evaluated by renewal-theoretic methods as mentioned in this paper.
Abstract: For the solutions of certain random equations, or equivalently the stationary solutions of certain random recurrences, the distribution tails are evaluated by renewal-theoretic methods. Six such equations, including one arising in queueing theory, are studied in detail. Implications in extreme-value theory are discussed by way of an illustration from economics.

598 citations

Book ChapterDOI
01 Jan 1983
TL;DR: In this article, a mixture of a countable number of normal distributions is used to estimate a density f(x) on the real line, which is then used for kernel estimation.
Abstract: Publisher Summary This chapter discusses Bayesian density estimation by mixtures of normal distributions and discusses the estimation of an arbitrary density f(x) on the real line. This density is modeled as a mixture of a countable number of normal distributions. Using such mixtures, any distribution on the real line can be approximated to within any preassigned accuracy in the Levy metric and any density on the real line can be approximated similarly in the L1 norm. Thus, the problem can be considered nonparametric. Asymptotic theory for kernel estimators involves the problems of letting the window size tend to zero at some rate as the sample size tends to infinity.

457 citations

Journal ArticleDOI
TL;DR: In this article, it was shown that the posterior probability converges to point mass at the true parameter value among almost all sample sequences (for short, the posterior is consistent; see Definition 1) exactly for parameter values in the topological carrier of the prior.
Abstract: Doob (1949) obtained a very general result on the consistency of Bayes' estimates. Loosely, if any consistent estimates are available, then the Bayes' estimates are consistent for almost all values of the parameter under the prior measure. If the parameter is thought of as being selected by nature through a random mechanism whose probability law is known, Doob's result is completely satisfactory. On the other hand, in some circumstances it is necessary to identify the exceptional null set. For example, if the parameter is thought of as fixed but unknown, and the prior measure is chosen as a convenient way to calculate estimates, it is important to know for which null set the method fails. In particular, it is desirable to choose the prior so that the null set is in fact empty. The problem is very delicate; considerable work [8], [9], [12] has been done on it recently, in quite general contexts and under severe regularity assumptions. It might therefore be of interest to discuss the simplest possible case, that of independent, identically distributed, discrete observations, in some detail. This will be done in Sections 3 and 4 when the observations take a finite set of possible values. Under this assumption, Section 3 shows that the posterior probability converges to point mass at the true parameter value among almost all sample sequences (for short, the posterior is consistent; see Definition 1) exactly for parameter values in the topological carrier of the prior. In Section 4, the asymptotic normality of the posterior is shown to follow from a local smoothness assumption about the prior. In both sections, results are obtained for priors which admit the possibility of an infinite number of states. The results of these sections are not entirely new; see pp. 333 ff. of [7], pp. 224 ff. of [10], [11]. They have not appeared in the literature, to the best of our knowledge, in a form as precise as Theorems 1, 3, 4. Theorem 2 is essentially the relevant special case of Theorem 7.4 of Schwartz (1961). In Sections 5 and 6, the case of a countable number of possible values is treated. We believe the results to be new. Here the general problem appears, because priors which assign positive mass near the true parameter value may lead to ridiculous estimates. The results of Section 3 (let alone 4) are false. In fact, Theorem 5 of Section 5 gives the following construction. Suppose that under the true parameter value the observations take an infinite number of values with positive probability. Then given any spurious (sub-)stochastic probability distribution, it is possible to find a prior assigning positive mass to any neighborhood of the true parameter value, but leading to a posterior probability which converges for almost all sample sequences to point mass at the spurious distribution. Indeed, there is a prior assigning positive mass to every open set of parameters, for which the posterior is consistent only at a set of parameters of the first category. To some extent, this happens because at any stage information about a finite number of stages only is available, but on the basis of this evidence, conclusions must be drawn about all states. If the prior measure has a serious prejudice about the shape of the tails, disaster ensues. In Section 6, it is shown that a simple condition on the prior measure (which serves to limit this prejudice) ensures the consistency of the posterior. Prior probabilities leading to posterior distributions consistent at all and asymptotically normal at essentially all (see Remark 3, Section 3) parameter values are constructed. Section 5 is independent of Sections 3 and 4; Section 6 is not. Section 6 overlaps to some extent with unpublished work of Kiefer and Wolfowitz; it has been extended in certain directions by Fabius (1963). The results of this paper were announced in [5]; some related work for continuous state space is described in [3]. It is a pleasure to thank two very helpful referees: whatever expository merit Section 5 has is due to them and to L. J. Savage.

421 citations