scispace - formally typeset
Search or ask a question

Showing papers on "Posterior probability published in 1997"


Journal ArticleDOI
TL;DR: It is shown how the proposed bidirectional structure can be easily modified to allow efficient estimation of the conditional posterior probability of complete symbol sequences without making any explicit assumption about the shape of the distribution.
Abstract: In the first part of this paper, a regular recurrent neural network (RNN) is extended to a bidirectional recurrent neural network (BRNN). The BRNN can be trained without the limitation of using input information just up to a preset future frame. This is accomplished by training it simultaneously in positive and negative time direction. Structure and training procedure of the proposed network are explained. In regression and classification experiments on artificial data, the proposed structure gives better results than other approaches. For real data, classification experiments for phonemes from the TIMIT database show the same tendency. In the second part of this paper, it is shown how the proposed bidirectional structure can be easily modified to allow efficient estimation of the conditional posterior probability of complete symbol sequences without making any explicit assumption about the shape of the distribution. For this part, experiments on real data are reported.

7,290 citations


Journal ArticleDOI
TL;DR: In this article, the authors compare several methods of estimating Bayes factors when it is possible to simulate observations from the posterior distributions, via Markov chain Monte Carlo or other techniques, provided that each posterior distribution is well behaved in the sense of having a single dominant mode.
Abstract: The Bayes factor is a ratio of two posterior normalizing constants, which may be difficult to compute. We compare several methods of estimating Bayes factors when it is possible to simulate observations from the posterior distributions, via Markov chain Monte Carlo or other techniques. The methods that we study are all easily applied without consideration of special features of the problem, provided that each posterior distribution is well behaved in the sense of having a single dominant mode. We consider a simulated version of Laplace's method, a simulated version of Bartlett correction, importance sampling, and a reciprocal importance sampling technique. We also introduce local volume corrections for each of these. In addition, we apply the bridge sampling method of Meng and Wong. We find that a simulated version of Laplace's method, with local volume correction, furnishes an accurate approximation that is especially useful when likelihood function evaluations are costly. A simple bridge sampli...

2,191 citations


Journal ArticleDOI
TL;DR: In this paper, a hierarchical prior model is proposed to deal with weak prior information while avoiding the mathematical pitfalls of using improper priors in the mixture context, which can be used as a basis for a thorough presentation of many aspects of the posterior distribution.
Abstract: New methodology for fully Bayesian mixture analysis is developed, making use of reversible jump Markov chain Monte Carlo methods that are capable of jumping between the parameter subspaces corresponding to different numbers of components in the mixture A sample from the full joint distribution of all unknown variables is thereby generated, and this can be used as a basis for a thorough presentation of many aspects of the posterior distribution The methodology is applied here to the analysis of univariate normal mixtures, using a hierarchical prior model that offers an approach to dealing with weak prior information while avoiding the mathematical pitfalls of using improper priors in the mixture context

2,018 citations


Journal ArticleDOI
TL;DR: An improved Bayesian method is presented for estimating phylogenetic trees using DNA sequence data, and the posterior probabilities of phylogenies are used to estimate the maximum posterior probability (MAP) tree, which has a probability of approximately 95%.
Abstract: An improved Bayesian method is presented for estimating phylogenetic trees using DNA sequence data. The birth-death process with species sampling is used to specify the prior distribution of phylogenies and ancestral speciation times, and the posterior probabilities of phylogenies are used to estimate the maximum posterior probability (MAP) tree. Monte Carlo integration is used to integrate over the ancestral speciation times for particular trees. A Markov Chain Monte Carlo method is used to generate the set of trees with the highest posterior probabilities. Methods are described for an empirical Bayesian analysis, in which estimates of the speciation and extinction rates are used in calculating the posterior probabilities, and a hierarchical Bayesian analysis, in which these parameters are removed from the model by an additional integration. The Markov Chain Monte Carlo method avoids the requirement of our earlier method for calculating MAP trees to sum over all possible topologies (which limited the number of taxa in an analysis to about five). The methods are applied to analyze DNA sequences for nine species of primates, and the MAP tree, which is identical to a maximum-likelihood estimate of topology, has a probability of approximately 95%.

1,230 citations


01 Jan 1997
TL;DR: In this article, a hierarchical prior model is used to deal with weak prior information while avoiding the mathematical pitfalls of using improper priors in the mixture context, and a sample from the full joint distribution of all unknown variables is generated, and this can be used as a basis for a thorough presentation of many aspects of the posterior distribution.
Abstract: SUMMARY New methodology for fully Bayesian mixture analysis is developed, making use of reversible jump Markov chain Monte Carlo methods that are capable of jumping between the parameter subspaces corresponding to different numbers of components in the mixture. A sample from the full joint distribution of all unknown variables is thereby generated, and this can be used as a basis for a thorough presentation of many aspects of the posterior distribution. The methodology is applied here to the analysis of univariate normal mixtures, using a hierarchical prior model that offers an approach to dealing with weak prior information while avoiding the mathematical pitfalls of using improper priors in the mixture context.

1,229 citations


Journal ArticleDOI
TL;DR: A new approach to shape recognition based on a virtually infinite family of binary features (queries) of the image data, designed to accommodate prior information about shape invariance and regularity, and a comparison with artificial neural networks methods is presented.
Abstract: We explore a new approach to shape recognition based on a virtually infinite family of binary features (queries) of the image data, designed to accommodate prior information about shape invariance and regularity. Each query corresponds to a spatial arrangement of several local topographic codes (or tags), which are in themselves too primitive and common to be informative about shape. All the discriminating power derives from relative angles and distances among the tags. The important attributes of the queries are a natural partial ordering corresponding to increasing structure and complexity; semi-invariance, meaning that most shapes of a given class will answer the same way to two queries that are successive in the ordering; and stability, since the queries are not based on distinguished points and substructures. No classifier based on the full feature set can be evaluated, and it is impossible to determine a priori which arrangements are informative. Our approach is to select informative features and build tree classifiers at the same time by inductive learning. In effect, each tree provides an approximation to the full posterior where the features chosen depend on the branch that is traversed. Due to the number and nature of the queries, standard decision tree construction based on a fixed-length feature vector is not feasible. Instead we entertain only a small random sample of queries at each node, constrain their complexity to increase with tree depth, and grow multiple trees. The terminal nodes are labeled by estimates of the corresponding posterior distribution over shape classes. An image is classified by sending it down every tree and aggregating the resulting distributions. The method is applied to classifying handwritten digits and synthetic linear and nonlinear deformations of three hundred L AT E X symbols. Stateof-the-art error rates are achieved on the National Institute of Standards and Technology database of digits. The principal goal of the experiments on L AT E X symbols is to analyze invariance, generalization error and related issues, and a comparison with artificial neural networks methods is presented in this context.

1,214 citations


Journal ArticleDOI
TL;DR: Bayesian statistical analysis of the conformations of side chains in proteins from the Protein Data Bank shows a strong similarity with the experimental distributions, indicating that proteins attain their lowest energy rotamers with respect to local backbone‐side‐chain interactions.
Abstract: We present a Bayesian statistical analysis of the conformations of side chains in proteins from the Protein Data Bank. This is an extension of the backbone-dependent rotamer library, and includes rotamer populations and average chi angles for a full range of phi, psi values. The Bayesian analysis used here provides a rigorous statistical method for taking account of varying amounts of data. Bayesian statistics requires the assumption of a prior distribution for parameters over their range of possible values. This prior distribution can be derived from previous data or from pooling some of the present data. The prior distribution is combined with the data to form the posterior distribution, which is a compromise between the prior distribution and the data. For the chi 2, chi 3, and chi 4 rotamer prior distributions, we assume that the probability of each rotamer type is dependent only on the previous chi rotamer in the chain. For the backbone-dependence of the chi 1 rotamers, we derive prior distributions from the product of the phi-dependent and psi-dependent probabilities. Molecular mechanics calculations with the CHARMM22 potential show a strong similarity with the experimental distributions, indicating that proteins attain their lowest energy rotamers with respect to local backbone-side-chain interactions. The new library is suitable for use in homology modeling, protein folding simulations, and the refinement of X-ray and NMR structures.

814 citations


Journal ArticleDOI
TL;DR: A new estimator, which is called the maximum local mass (MLM) estimate, that integrates local probability density and uses an optimality criterion that is appropriate for perception tasks: It finds the most probable approximately correct answer.
Abstract: The problem of color constancy may be solved if we can recover the physical properties of illuminants and surfaces from photosensor responses. We consider this problem within the framework of Bayesian decision theory. First, we model the relation among illuminants, surfaces, and photosensor responses. Second, we construct prior distributions that describe the probability that particular illuminants and surfaces exist in the world. Given a set of photosensor responses, we can then use Bayes’s rule to compute the posterior distribution for the illuminants and the surfaces in the scene. There are two widely used methods for obtaining a single best estimate from a posterior distribution. These are maximum a posteriori (MAP) and minimum mean-squared-error (MMSE) estimation. We argue that neither is appropriate for perception problems. We describe a new estimator, which we call the maximum local mass (MLM) estimate, that integrates local probability density. The new method uses an optimality criterion that is appropriate for perception tasks: It finds the most probable approximately correct answer. For the case of low observation noise, we provide an efficient approximation. We develop the MLM estimator for the color-constancy problem in which flat matte surfaces are uniformly illuminated. In simulations we show that the MLM method performs better than the MAP estimator and better than a number of standard color-constancy algorithms. We note conditions under which even the optimal estimator produces poor estimates: when the spectral properties of the surfaces in the scene are biased. © 1997 Optical Society of America [S0740-3232(97)01607-4]

466 citations


Journal ArticleDOI
TL;DR: This paper presents a new methodology for making Bayesian inference about exponential family regression models, overdispersed data and longitudinal studies and involves the use of Markov chain Monte Carlo techniques.
Abstract: Generalized linear mixed models provide a unified framework for treatment of exponential family regression models, overdispersed data and longitudinal studies. These problems typically involve the presence of random effects and this paper presents a new methodology for making Bayesian inference about them. The approach is simulation-based and involves the use of Markov chain Monte Carlo techniques. The usual iterative weighted least squares algorithm is extended to include a sampling step based on the Metropolis–Hastings algorithm thus providing a unified iterative scheme. Non-normal prior distributions for the regression coefficients and for the random effects distribution are considered. Random effect structures with nesting required by longitudinal studies are also considered. Particular interests concern the significance of regression coefficients and assessment of the form of the random effects. Extensions to unknown scale parameters, unknown link functions, survival and frailty models are outlined.

329 citations


Journal Article
TL;DR: In this paper, a hierarchical prior model is proposed to deal with weak prior information while avoiding the mathematical pitfalls of using improper priors in the mixture context, and a sample from the full joint distribution of all unknown variables is generated, which can be used as a basis for a thorough presentation of many aspects of the posterior distribution.
Abstract: New methodology for fully Bayesian mixture analysis is developed, making use of reversible jump Markov chain Monte Carlo methods that are capable of jumping between the parameter subspaces corresponding to different numbers of components in the mixture. A sample from the full joint distribution of all unknown variables is thereby generated, and this can be used as a basis for a thorough presentation of many aspects of the posterior distribution. The methodology is applied here to the analysis of univariate normal mixtures, using a hierarchical prior model that offers an approach to dealing with weak prior information while avoiding the mathematical pitfalls of using improper priors in the mixture context.

258 citations


Journal ArticleDOI
TL;DR: The proposed dynamic sampling algorithms use posterior samples from previous updating stages and exploit conditional independence between groups of parameters to allow samples of parameters no longer of interest to be discarded, such as when a patient dies or is discharged.
Abstract: In dynamic statistical modeling situations, observations arise sequentially, causing the model to expand by progressive incorporation of new data items and new unknown parameters. For example, in clinical monitoring, patients and data arrive sequentially, and new patient-specific parameters are introduced with each new patient. Markov chain Monte Carlo (MCMC) might be used for continuous updating of the evolving posterior distribution, but would need to be restarted from scratch at each expansion stage. Thus MCMC methods are often too slow for real-time inference in dynamic contexts. By combining MCMC with importance resampling, we show how real-time sequential updating of posterior distributions can be effected. The proposed dynamic sampling algorithms use posterior samples from previous updating stages and exploit conditional independence between groups of parameters to allow samples of parameters no longer of interest to be discarded, such as when a patient dies or is discharged. We apply the ...

Journal ArticleDOI
TL;DR: For the Cardiovascular Health Study, Bayesian model averaging predictively outperforms standard model selection and does a better job of assessing who is at high risk for a stroke.
Abstract: SUMMARY In the context of the Cardiovascular Health Study, a comprehensive investigation into the risk factors for strokes, we apply Bayesian model averaging to the selection of variables in Cox proportional hazard models. We use an extension of the leaps-and-bounds algorithm for locating the models that are to be averaged over and make available S-PLUS software to implement the methods. Bayesian model averaging provides a posterior probability that each variable belongs in the model, a more directly interpretable measure of variable importance than a P-value. P-values from models preferred by stepwise methods tend to overstate the evidence for the predictive value of a variable and do not account for model uncertainty. We introduce the partial predictive score to evaluate predictive performance. For the Cardiovascular Health Study, Bayesian model averaging predictively outperforms standard model selection and does a better

Journal ArticleDOI
TL;DR: In this article, the Bonferroni multiple testing method is extended to multiple testing and the posterior probability of the null hypothesis is adjusted by multiplying by k, the number of tests considered.
Abstract: SUMMARY Bayes/frequentist correspondences between the p-value and the posterior probability of the null hypothesis have been studied in univariate hypothesis testing situations. This paper extends these comparisons to multiple testing and in particular to the Bonferroni multiple testing method, in which p-values are adjusted by multiplying by k, the number of tests considered. In the Bayesian setting, prior assessments may need to be adjusted to account for multiple hypotheses, resulting in corresponding adjustments to the posterior probabilities. Conditions are given for which the adjusted posterior probabilities roughly correspond to Bonferroni adjusted p-values.

Journal ArticleDOI
TL;DR: In this article, a Bayesian analysis of the stochastic frontier model with composed error is presented, and the existence of the posterior distribution and posterior moments is examined under a commonly used class of (partly) noninformative prior distributions.

Journal ArticleDOI
TL;DR: Evidence is presented indicating that, in some domains, normal (Gaussian) distributions are more accurate than uniform distributions for modeling feature fluctuations, which motivates the development of new maximum-likelihood and MAP recognition formulations which are based on normal feature models.
Abstract: This paper examines statistical approaches to model-based object recognition. Evidence is presented indicating that, in some domains, normal (Gaussian) distributions are more accurate than uniform distributions for modeling feature fluctuations. This motivates the development of new maximum-likelihood and MAP recognition formulations which are based on normal feature models. These formulations lead to an expression for the posterior probability of the pose and correspondences given an image. Several avenues are explored for specifying a recognition hypothesis. In the first approach, correspondences are included as a part of the hypotheses. Search for solutions may be ordered as a combinatorial search in correspondence space, or as a search over pose space, where the same criterion can equivalently be viewed as a robust variant of chamfer matching. In the second approach, correspondences are not viewed as being a part of the hypotheses. This leads to a criterion that is a smooth function of pose that is amenable to local search by continuous optimization methods. The criteria is also suitable for optimization via the Expectation-Maximization (EM) algorithm, which alternates between pose refinement and re-estimation of correspondence probabilities until convergence is obtained. Recognition experiments are described using the criteria with features derived from video images and from synthetic range images.

Proceedings Article
01 Aug 1997
TL;DR: A new method is developed to represent probabilistic relations on multiple random events by using a powerful way of specifying conditional probability distributions in these networks, which provides for constraints on equalities of events and allows to define complex, nested combination functions.
Abstract: A new method is developed to represent probabilistic relations on multiple random events. Where previously knowledge bases containing probabilistic rules were used for this purpose, here a probability distribution over the relations is directly represented by a Bayesian network. By using a powerful way of specifying conditional probability distributions in these networks, the resulting formalism is more expressive than the previous ones. Particularly, it provides for constraints on equalities of events, and it allows to define complex, nested combination functions.

Journal ArticleDOI
TL;DR: It is shown that the assignment of penalties in the puzzling step of the QP algorithm is a special case of a more general Bayesian weighting scheme for quartet topologies, which leads to an improvement in the efficiency of QP at recovering the true tree as well as to better theoretical understanding of the method itself.
Abstract: Quartet puzzling (QP), a heuristic tree search procedure for maximum-likelihood trees, has recently been introduced (Strimmer and von Haeseler 1996). This method uses maximum-likelihood criteria for quartets of taxa which are then combined to form trees based on larger numbers of taxa. Thus, QP can be practically applied to data sets comprising a much greater number of taxa than can other search algorithms such as stepwise addition and subsequent branch swapping as implemented, e.g., in DNAML (Felsenstein 1993). However, its ability to reconstruct the true tree is less than that of DNAML (Strimmer and von Haeseler 1996). Here, we show that the assignment of penalties in the puzzling step of the QP algorithm is a special case of a more general Bayesian weighting scheme for quartet topologies. Application of this general framework leads to an improvement in the efficiency of QP at recovering the true tree as well as to better theoretical understanding of the method itself. On average, the accuracy of QP increases by 10% over all cases studied, without compromising speed or requiring more computer memory. Consider the three different fully-bifurcating tree topologies Q,, Q2, and Q3 for four taxa (fig. 1). Denote by ml, m2, and m3 their corresponding maximum-likelihood (not log-likelihood) values. Note that ml + m2 + m3 << 1. Evaluation via Bayes’ theorem of the three tree topologies given uniform prior information leads to posterior probabilities

Journal ArticleDOI
TL;DR: The authors use recent results in Markov chain Monte Carlo (MCMC) sampling to estimate the relative values of Gibbs partition functions and using these values, sample from joint posterior distributions on image scenes to enable a fully Bayesian procedure which does not fix the hyperparameters at some estimated or specified value, but enables uncertainty about these values to be propagated through to the estimated intensities.
Abstract: In recent years, many investigators have proposed Gibbs prior models to regularize images reconstructed from emission computed tomography data. Unfortunately, hyperparameters used to specify Gibbs priors can greatly influence the degree of regularity imposed by such priors and, as a result, numerous procedures have been proposed to estimate hyperparameter values, from observed image data. Many of these, procedures attempt to maximize the joint posterior distribution on the image scene. To implement these methods, approximations to the joint posterior densities are required, because the dependence of the Gibbs partition function on the hyperparameter values is unknown. Here, the authors use recent results in Markov chain Monte Carlo (MCMC) sampling to estimate the relative values of Gibbs partition functions and using these values, sample from joint posterior distributions on image scenes. This allows for a fully Bayesian procedure which does not fix the hyperparameters at some estimated or specified value, but enables uncertainty about these values to be propagated through to the estimated intensities. The authors utilize realizations from the posterior distribution for determining credible regions for the intensity of the emission source. The authors consider two different Markov random field (MRF) models-the power model and a line-site model. As applications they estimate the posterior distribution of source intensities from computer simulated data as well as data collected from a physical single photon emission computed tomography (SPECT) phantom.

Journal ArticleDOI
TL;DR: A framework of quasi-Bayes (QB) learning of the parameters of the continuous density hidden Markov model (CDHMM) with Gaussian mixture state observation densities with simple forgetting mechanism to adjust the contribution of previously observed sample utterances is presented.
Abstract: We present a framework of quasi-Bayes (QB) learning of the parameters of the continuous density hidden Markov model (CDHMM) with Gaussian mixture state observation densities. The QB formulation is based on the theory of recursive Bayesian inference. The QB algorithm is designed to incrementally update the hyperparameters of the approximate posterior distribution and the CDHMM parameters simultaneously. By further introducing a simple forgetting mechanism to adjust the contribution of previously observed sample utterances, the algorithm is adaptive in nature and capable of performing an online adaptive learning using only the current sample utterance. It can, thus, be used to cope with the time-varying nature of some acoustic and environmental variabilities, including mismatches caused by changing speakers, channels, and transducers. As an example, the QB learning framework is applied to on-line speaker adaptation and its viability is confirmed in a series of comparative experiments using a 26-letter English alphabet vocabulary.

Journal ArticleDOI
TL;DR: In this paper, the authors consider priors obtained by ensuring approximate frequentist validity of posterior quantiles and the posterior distribution function, and show that, at the second order of approximation, the two approaches do not necessarily lead to identical conclusions.
Abstract: SUMMARY The paper considers priors obtained by ensuring approximate frequentist validity of (a) posterior quantiles, and of (b) the posterior distribution function. It is seen that, at the second order of approximation, the two approaches do not necessarily lead to identical conclusions. Examples are given to illustrate this. The role of invariance in the context of probability matching is also discussed.

Journal ArticleDOI
TL;DR: This work proposes an approach that dramatically decreases the computation time of the standard bootstrap filter and at the same time preserves its excellent performance.
Abstract: In discrete-time system analysis, nonlinear recursive state estimation is often addressed by a Bayesian approach using a resampling technique called the weighted bootstrap. Bayesian bootstrap filtering is a very powerful technique since it is not restricted by model assumptions of linearity and/or Gaussian noise. The standard implementation of the bootstrap filter, however, is not time efficient for large sample sizes, which often precludes its utilization. We propose an approach that dramatically decreases the computation time of the standard bootstrap filter and at the same time preserves its excellent performance. The time decrease is realized by resampling the prior into the posterior distribution at time instant k by using sampling blocks of varying size, rather than a sample at a time as in the standard approach. The size of each block resampled into the posterior in the algorithm proposed here depends on the product of the normalized weight determined by the likelihood function for each prior sample and the sample size N under consideration.

Journal ArticleDOI
TL;DR: An alternative design is proposed that bases decisions on the ability of the data to persuade either a sceptic or an enthusiast, and calls for termination at any interim analysis at which an observed persuasion probability exceeds its critical value.
Abstract: Many popular sequential phase II clinical trial designs optimize some criterion subject to constraints on the error probabilities at null and alternative values of the response rate. Such designs may forfeit optimality if one fails to conduct analyses strictly according to plan. Moreover, a decision, say, to accept the experimental therapy at one interim analysis does not necessarily imply the same degree of evidence as the same decision when made at another analysis. I propose an alternative design that bases decisions on the ability of the data to persuade either a sceptic or an enthusiast. My standard of evidence, called the persuasion probability, is based on the Bayesian posterior probability that the experimental treatment is superior to the standard. The design calls for termination at any interim analysis at which an observed persuasion probability exceeds its critical value. I investigate the standards of evidence implied by some frequentist procedures and calculate frequentist properties of persuasion-probability designs.

Journal ArticleDOI
TL;DR: In this paper, the authors consider probability models for the estimation of normal means that allow for some of the means to be equal, called product partition models, specify prior probabilities for a random partition, and the posterior probability of the partition given the observations has the same form.
Abstract: I consider probability models for the estimation of normal means that allow for some of the means to be equal. These probability models, called product partition models, specify prior probabilities for a random partition. The posterior probability of the partition given the observations has the same form. The resulting estimate of the means—the product estimate—is obtained by conditioning on the partition and summing over all possible partitions. The large number of computations involved leads to the use of Markov sampling to compute the product estimate. I compare the product estimate to other estimates of normal means both in a simulation study and in the prediction of baseball batting averages.

Journal ArticleDOI
TL;DR: This paper investigates the derivation for the identification of terms which are discarded as being asymptotically negligible, but which may be significant in small to moderate sample-size applications and suggests several SIC variants based on the inclusion of these terms.
Abstract: The Schwarz (1978) information criterion, SIC, is a widely-used tool in model selection, largely due to its computational simplicity and effective performance in many modeling frameworks. The derivation of SIC (Schwarz, 1978) establishes the criterion as an asymptotic approximation to a transformation of the Bayesian posterior probability of a candidate model. In this paper, we investigate the derivation for the identification of terms which are discarded as being asymptotically negligible, but which may be significant in small to moderate sample-size applications. We suggest several SIC variants based on the inclusion of these terms. The results of a simulation study show that the variants improve upon the performance of SIC in two important areas of application:multiple linear regression and time series analysis.

Journal ArticleDOI
TL;DR: Empirical results show that hypergeometric pre-pruning should be done in most cases, as trees pruned in this way are simpler and more efficient, and typically no less accurate than unpruned or post-pruned trees.
Abstract: ID3‘s information gain heuristic is well-known to be biased towards multi-valued attributes. This bias is only partially compensated for by C4.5‘s gain ratio. Several alternatives have been proposed and are examined here (distance, orthogonality, a Beta function, and two chi-squared tests). All of these metrics are biased towards splits with smaller branches, where low-entropy splits are likely to occur by chance. Both classical and Bayesian statistics lead to the multiple hypergeometric distribution as the exact posterior probability of the null hypothesis that the class distribution is independent of the split. Both gain and the chi-squared tests arise in asymptotic approximations to the hypergeometric, with similar criteria for their admissibility. Previous failures of pre-pruning are traced in large part to coupling these biased approximations with one another or with arbitrary thresholds; problems which are overcome by the hypergeometric. The choice of split-selection metric typically has little effect on accuracy, but can profoundly affect complexity and the effectiveness and efficiency of pruning. Empirical results show that hypergeometric pre-pruning should be done in most cases, as trees pruned in this way are simpler and more efficient, and typically no less accurate than unpruned or post-pruned trees.

Proceedings Article
01 Dec 1997
TL;DR: This paper shows how the ensemble learning approach can be extended to full-covariance Gaussian distributions while remaining computationally tractable, and extends the framework to deal with hyperparameters, leading to a simple re-estimation procedure.
Abstract: Bayesian treatments of learning in neural networks are typically based either on local Gaussian approximations to a mode of the posterior weight distribution, or on Markov chain Monte Carlo simulations. A third approach, called ensemble learning, was introduced by Hinton and van Camp (1993). It aims to approximate the posterior distribution by minimizing the Kullback-Leibler divergence between the true posterior and a parametric approximating distribution. However, the derivation of a deterministic algorithm relied on the use of a Gaussian approximating distribution with a diagonal covariance matrix and so was unable to capture the posterior correlations between parameters. In this paper, we show how the ensemble learning approach can be extended to full-covariance Gaussian distributions while remaining computationally tractable. We also extend the framework to deal with hyperparameters, leading to a simple re-estimation procedure. Initial results from a standard benchmark problem are encouraging.

Journal ArticleDOI
TL;DR: Failure times that are grouped according to shared environments arise commonly in statistical practice, where multiple responses may be observed for each of many units and unit-specific parameters are modelled parametrically.
Abstract: Failure times that are grouped according to shared environments arise commonly in statistical practice. That is, multiple responses may be observed for each of many units. For instance, the units might be patients or centers in a clinical trial setting. Bayesian hierarchical models are appropriate for data analysis in this context. At the first stage of the model, survival times can be modelled via the Cox partial likelihood, using a justification due to Kalbfleisch (1978, Journal of the Royal Statistical Society, Series B 40, 214-221). Thus, questionable parametric assumptions are avoided. Conventional wisdom dictates that it is comparatively safe to make parametric assumptions at subsequent stages. Thus, unit-specific parameters are modelled parametrically. The posterior distribution of parameters given observed data is examined using Markov chain Monte Carlo methods. Specifically, the hybrid Monte Carlo method, as described by Neal (1993a, in Advances in Neural Information Processing 5, 475-482; 1993b, Probabilistic inference using Markov chain Monte Carlo methods), is utilized.

Journal ArticleDOI
TL;DR: This work extends the general M-hypothesis Bayesian detection problem where zero cost is assigned to correct decisions, and finds that the Bayesian cost function's exponential decay constant equals the minimum Chernoff distance among all distinct pairs of hypothesized probability distributions.
Abstract: In two-hypothesis detection problems with i.i.d. observations, the minimum error probability decays exponentially with the amount of data, with the constant in the exponent equal to the Chernoff distance between the probability distributions characterizing the hypotheses. We extend this result to the general M-hypothesis Bayesian detection problem where zero cost is assigned to correct decisions, and find that the Bayesian cost function's exponential decay constant equals the minimum Chernoff distance among all distinct pairs of hypothesized probability distributions.

Journal ArticleDOI
TL;DR: In this paper, the question of whether, and when, the Bayesian approach produces worthwhile answers is investigated conditionally, given the information provided by the experiment, and an important initial insight on the matter is that posterior estimates of a non-identifiable parameter can actually be inferior to the prior (no-data) estimate of that parameter, even as the sample size grows to infinity.
Abstract: Although classical statistical methods are inapplicable in point estimation problems involving nonidentifiable parameters, a Bayesian analysis using proper priors can produce a closed form, interpretable point estimate in such problems. The question of whether, and when, the Bayesian approach produces worthwhile answers is investigated. In contrast to the preposterior analysis of this question offered by Kadane, we examine the question conditionally, given the information provided by the experiment. An important initial insight on the matter is that posterior estimates of a nonidentifiable parameter can actually be inferior to the prior (no-data) estimate of that parameter, even as the sample size grows to infinity. In general, our goal is to characterize, within the space of prior distributions, classes of priors that lead to posterior estimates that are superior, in some reasonable sense, to one's prior estimate. This goal is shown to be feasible through a detailed examination of a particular t...

Journal ArticleDOI
R. van Engelen1
TL;DR: A general framework for approximating Bayesian belief networks through model simplification by arc removal is proposed, giving an upper bound on the absolute error allowed on the prior and posterior probability distributions of the approximated network.
Abstract: I propose a general framework for approximating Bayesian belief networks through model simplification by arc removal. Given an upper bound on the absolute error allowed on the prior and posterior probability distributions of the approximated network, a subset of arcs is removed, thereby speeding up probabilistic inference.