scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Comparison of Bayesian predictive methods for model selection

01 May 2017-Statistics and Computing (Springer US)-Vol. 27, Iss: 3, pp 711-735
TL;DR: The study demonstrates that the model selection can greatly benefit from using cross-validation outside the searching process both for guiding the model size selection and assessing the predictive performance of the finally selected model.
Abstract: The goal of this paper is to compare several widely used Bayesian model selection methods in practical model selection problems, highlight their differences and give recommendations about the preferred approaches. We focus on the variable subset selection for regression and classification and perform several numerical experiments using both simulated and real world data. The results show that the optimization of a utility estimate such as the cross-validation (CV) score is liable to finding overfitted models due to relatively high variance in the utility estimates when the data is scarce. This can also lead to substantial selection induced bias and optimism in the performance evaluation for the selected model. From a predictive viewpoint, best results are obtained by accounting for model uncertainty by forming the full encompassing model, such as the Bayesian model averaging solution over the candidate models. If the encompassing model is too complex, it can be robustly simplified by the projection method, in which the information of the full model is projected onto the submodels. This approach is substantially less prone to overfitting than selection based on CV-score. Overall, the projection method appears to outperform also the maximum a posteriori model and the selection of the most probable variables. The study also demonstrates that the model selection can greatly benefit from using cross-validation outside the searching process both for guiding the model size selection and assessing the predictive performance of the finally selected model.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: In this paper, leave-one-out cross-validation (LOO) and the widely applicable information criterion (WAIC) are used to estimate pointwise out-of-sample prediction accuracy from a fitted Bayesian model using the log-likelihood evaluated at the posterior simulations of the parameter values.
Abstract: Leave-one-out cross-validation (LOO) and the widely applicable information criterion (WAIC) are methods for estimating pointwise out-of-sample prediction accuracy from a fitted Bayesian model using the log-likelihood evaluated at the posterior simulations of the parameter values. LOO and WAIC have various advantages over simpler estimates of predictive error such as AIC and DIC but are less used in practice because they involve additional computational steps. Here we lay out fast and stable computations for LOO and WAIC that can be performed using existing simulation draws. We introduce an efficient computation of LOO using Pareto-smoothed importance sampling (PSIS), a new procedure for regularizing importance weights. Although WAIC is asymptotically equal to LOO, we demonstrate that PSIS-LOO is more robust in the finite case with weak priors or influential observations. As a byproduct of our calculations, we also obtain approximate standard errors for estimated predictive errors and for comparison of predictive errors between two models. We implement the computations in an R package called loo and demonstrate using models fit with the Bayesian inference package Stan.

1,533 citations

Book
01 Jan 2009
TL;DR: In this article, the authors introduce singular learning theory and singular learning machines, and present a list of the main sources of singular information science Bibliography, including the Singular Information Science Bibliography Index (SISB).
Abstract: Preface 1. Introduction 2. Singularity theory 3. Algebraic geometry 4. Zeta functions and singular integral 5. Empirical processes 6. Singular learning theory 7. Singular learning machines 8. Singular information science Bibliography Index.

177 citations

Journal ArticleDOI
TL;DR: A concept of effective number of nonzero parameters is introduced, an intuitive way of formulating the prior for the global hyperparameter based on the sparsity assumptions is shown, and the previous default choices are argued to be dubious based on their tendency to favor solutions with more unshrunk parameters than the authors typically expect a priori.
Abstract: The horseshoe prior has proven to be a noteworthy alternative for sparse Bayesian estimation, but has previously suffered from two problems. First, there has been no systematic way of specifying a prior for the global shrinkage hyperparameter based on the prior information about the degree of sparsity in the parameter vector. Second, the horseshoe prior has the undesired property that there is no possibility of specifying separately information about sparsity and the amount of regularization for the largest coefficients, which can be problematic with weakly identified parameters, such as the logistic regression coefficients in the case of data separation. This paper proposes solutions to both of these problems. We introduce a concept of effective number of nonzero parameters, show an intuitive way of formulating the prior for the global hyperparameter based on the sparsity assumptions, and argue that the previous default choices are dubious based on their tendency to favor solutions with more unshrunk parameters than we typically expect a priori. Moreover, we introduce a generalization to the horseshoe prior, called the regularized horseshoe, that allows us to specify a minimum level of regularization to the largest values. We show that the new prior can be considered as the continuous counterpart of the spike-and-slab prior with a finite slab width, whereas the original horseshoe resembles the spike-and-slab with an infinitely wide slab. Numerical experiments on synthetic and real world data illustrate the benefit of both of these theoretical advances.

151 citations


Cites methods or result from "Comparison of Bayesian predictive m..."

  • ...Below this threshold the 1The experiments with the original horseshoe on the real world data are taken from Piironen and Vehtari (2017a) and were run using Stan version 2.12.0, whereas the experiments with the regularized horseshoe are run using newer version 2.15.1 nonzero components in β are too…...

    [...]

  • ...After fitting the full model, a truly sparse solution without losing predictive accuracy could be obtained using the projective variable selection (Piironen and Vehtari, 2017b)4....

    [...]

  • ...We use the four microarray cancer classification datasets from our earlier paper (Piironen and Vehtari, 2017a)....

    [...]

  • ...For instance, in binary classification, if we have the same number of observations from both classes, then µ = 0.5, yielding σ̃2 = 4 which was observed to give good results in our earlier study (Piironen and Vehtari, 2017a)....

    [...]

  • ...This paper deals with sparse Bayesian estimation and is an extension to our earlier work (Piironen and Vehtari, 2017a)....

    [...]

Journal ArticleDOI
TL;DR: An informal introduction to the foundational ideas behind Bayesian data analysis, using a linear mixed models analysis of data from a typical psycholinguistics experiment, and some examples illustrating the flexibility of model specification in the Bayesian framework.
Abstract: We present the fundamental ideas underlying statistical hypothesis testing using the frequentist framework. We start with a simple example that builds up the one-sample t-test from the beginning, explaining important concepts such as the sampling distribution of the sample mean, and the iid assumption. Then, we examine the meaning of the p-value in detail and discuss several important misconceptions about what a p-value does and does not tell us. This leads to a discussion of Type I, II error and power, and Type S and M error. An important conclusion from this discussion is that one should aim to carry out appropriately powered studies. Next, we discuss two common issues that we have encountered in psycholinguistics and linguistics: running experiments until significance is reached and the ‘garden-of-forking-paths’ problem discussed by Gelman and others. The best way to use frequentist methods is to run appropriately powered studies, check model assumptions, clearly separate exploratory data analysis from planned comparisons decided upon before the study was run, and always attempt to replicate results.

133 citations

Journal ArticleDOI
TL;DR: In this paper , the authors provide an overview of a wide range of theoretical, state-of-the-art models, methods, principles, and approaches to prepare, produce, organize, and evaluate forecasts.

119 citations

References
More filters
Journal ArticleDOI
TL;DR: In this paper, the authors consider the problem of comparing complex hierarchical models in which the number of parameters is not clearly defined and derive a measure pD for the effective number in a model as the difference between the posterior mean of the deviances and the deviance at the posterior means of the parameters of interest, which is related to other information criteria and has an approximate decision theoretic justification.
Abstract: Summary. We consider the problem of comparing complex hierarchical models in which the number of parameters is not clearly defined. Using an information theoretic argument we derive a measure pD for the effective number of parameters in a model as the difference between the posterior mean of the deviance and the deviance at the posterior means of the parameters of interest. In general pD approximately corresponds to the trace of the product of Fisher's information and the posterior covariance, which in normal models is the trace of the ‘hat’ matrix projecting observations onto fitted values. Its properties in exponential families are explored. The posterior mean deviance is suggested as a Bayesian measure of fit or adequacy, and the contributions of individual observations to the fit and complexity can give rise to a diagnostic plot of deviance residuals against leverages. Adding pD to the posterior mean deviance gives a deviance information criterion for comparing models, which is related to other information criteria and has an approximate decision theoretic justification. The procedure is illustrated in some examples, and comparisons are drawn with alternative Bayesian and classical proposals. Throughout it is emphasized that the quantities required are trivial to compute in a Markov chain Monte Carlo analysis.

11,691 citations

Journal ArticleDOI
TL;DR: In this article, the authors propose a new framework for the construction of reversible Markov chain samplers that jump between parameter subspaces of differing dimensionality, which is flexible and entirely constructive.
Abstract: Markov chain Monte Carlo methods for Bayesian computation have until recently been restricted to problems where the joint distribution of all variables has a density with respect to some fixed standard underlying measure. They have therefore not been available for application to Bayesian model determination, where the dimensionality of the parameter vector is typically not fixed. This paper proposes a new framework for the construction of reversible Markov chain samplers that jump between parameter subspaces of differing dimensionality, which is flexible and entirely constructive. It should therefore have wide applicability in model determination problems. The methodology is illustrated with applications to multiple change-point analysis in one and two dimensions, and to a Bayesian comparison of binomial experiments.

6,188 citations

Journal ArticleDOI
TL;DR: Bayesian model averaging (BMA) provides a coherent mechanism for ac- counting for this model uncertainty and provides improved out-of- sample predictive performance.
Abstract: Standard statistical practice ignores model uncertainty. Data analysts typically select a model from some class of models and then proceed as if the selected model had generated the data. This approach ignores the uncertainty in model selection, leading to over-confident inferences and decisions that are more risky than one thinks they are. Bayesian model averaging (BMA)provides a coherent mechanism for accounting for this model uncertainty. Several methods for implementing BMA have recently emerged. We discuss these methods and present a number of examples.In these examples, BMA provides improved out-of-sample predictive performance. We also provide a catalogue of currently available BMA software.

3,942 citations


"Comparison of Bayesian predictive m..." refers methods or result in this paper

  • ...Our results agree with what is known about the good performance of the BMA (Hoeting et al. 1999; Raftery and Zheng 2003)....

    [...]

  • ...See the review by Hoeting et al. (1999) for a thorough discussion of Bayesian model averaging....

    [...]

  • ...This agrees with what is known about the good performance of the BMA (Hoeting et al. 1999; Raftery and Zheng 2003)....

    [...]

  • ...This result is in perfect accordance with what is known about the good performance of theBMA(Hoeting et al. 1999; Raftery and Zheng 2003)....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the Gibbs sampler is used to indirectly sample from the multinomial posterior distribution on the set of possible subset choices to identify the promising subsets by their more frequent appearance in the Gibbs sample.
Abstract: A crucial problem in building a multiple regression model is the selection of predictors to include. The main thrust of this article is to propose and develop a procedure that uses probabilistic considerations for selecting promising subsets. This procedure entails embedding the regression setup in a hierarchical normal mixture model where latent variables are used to identify subset choices. In this framework the promising subsets of predictors can be identified as those with higher posterior probability. The computational burden is then alleviated by using the Gibbs sampler to indirectly sample from this multinomial posterior distribution on the set of possible subset choices. Those subsets with higher probability—the promising ones—can then be identified by their more frequent appearance in the Gibbs sample.

2,780 citations

Journal ArticleDOI
TL;DR: In this article, leave-one-out cross-validation (LOO) and the widely applicable information criterion (WAIC) are used to estimate pointwise out-of-sample prediction accuracy from a fitted Bayesian model using the log-likelihood evaluated at the posterior simulations of the parameter values.
Abstract: Leave-one-out cross-validation (LOO) and the widely applicable information criterion (WAIC) are methods for estimating pointwise out-of-sample prediction accuracy from a fitted Bayesian model using the log-likelihood evaluated at the posterior simulations of the parameter values. LOO and WAIC have various advantages over simpler estimates of predictive error such as AIC and DIC but are less used in practice because they involve additional computational steps. Here we lay out fast and stable computations for LOO and WAIC that can be performed using existing simulation draws. We introduce an efficient computation of LOO using Pareto-smoothed importance sampling (PSIS), a new procedure for regularizing importance weights. Although WAIC is asymptotically equal to LOO, we demonstrate that PSIS-LOO is more robust in the finite case with weak priors or influential observations. As a byproduct of our calculations, we also obtain approximate standard errors for estimated predictive errors and for comparing of predictive errors between two models. We implement the computations in an R package called 'loo' and demonstrate using models fit with the Bayesian inference package Stan.

2,455 citations