scispace - formally typeset
Search or ask a question
Posted Content

Approximate Laplace approximations for scalable model selection

TL;DR: It is proved that in generalized (possibly non‐linear) models ALA achieves a strong form of model selection consistency for a suitably‐defined optimal model, at the same functional rates as exact computation.
Abstract: We propose the approximate Laplace approximation (ALA) to evaluate integrated likelihoods, a bottleneck in Bayesian model selection. The Laplace approximation (LA) is a popular tool that speeds up such computation and equips strong model selection properties. However, when the sample size is large or one considers many models the cost of the required optimizations becomes impractical. ALA reduces the cost to that of solving a least-squares problem for each model. Further, it enables efficient computation across models such as sharing pre-computed sufficient statistics and certain operations in matrix decompositions. We prove that in generalized (possibly non-linear) models ALA achieves a strong form of model selection consistency for a suitably-defined optimal model, at the same functional rates as exact computation. We consider fixed- and high-dimensional problems, group and hierarchical constraints, and the possibility that all models are misspecified. We also obtain ALA rates for Gaussian regression under non-local priors, an important example where the LA can be costly and does not consistently estimate the integrated likelihood. Our examples include non-linear regression, logistic, Poisson and survival models. We implement the methodology in the R package mombf.
Citations
More filters
Posted Content
TL;DR: It is shown that asymptotically BMS keeps any covariate with predictive power for either the outcome or censoring times, and discards other covariates, and argues for using simple models that are computationally practical yet attain good power to detect potentially complex effects, despite misspecification.
Abstract: We discuss the role of misspecification and censoring on Bayesian model selection in the contexts of right-censored survival and concave log-likelihood regression. Misspecification includes wrongly assuming the censoring mechanism to be non-informative. Emphasis is placed on additive accelerated failure time, Cox proportional hazards and probit models. We offer a theoretical treatment that includes local and non-local priors, and a general non-linear effect decomposition to improve power-sparsity trade-offs. We discuss a fundamental question: what solution can one hope to obtain when (inevitably) models are misspecified, and how to interpret it? Asymptotically, covariates that do not have predictive power for neither the outcome nor (for survival data) censoring times, in the sense of reducing a likelihood-associated loss, are discarded. Misspecification and censoring have an asymptotically negligible effect on false positives, but their impact on power is exponential. We show that it can be advantageous to consider simple models that are computationally practical yet attain good power to detect potentially complex effects, including the use of finite-dimensional basis to detect truly non-parametric effects. We also discuss algorithms to capitalize on sufficient statistics and fast likelihood approximations for Gaussian-based survival and binary models.

20 citations


Cites methods from "Approximate Laplace approximations ..."

  • ...See also [37] for an approach based on approximate Laplace approximations that bypasses the optimization exercise altogether....

    [...]

Journal ArticleDOI
TL;DR: A particular finding is that one may use less sparse formulations than would be asymptotically optimal, but still attain consistency and often also significantly better finite-sample performance.
Abstract: We study frequentist properties of Bayesian and L0 model selection, with a focus on (potentially non-linear) high-dimensional regression. We propose a construction to study how posterior probabilities and normalized L0 criteria concentrate on the (Kullback-Leibler) optimal model and other subsets of the model space. When such concentration occurs, one also bounds the frequentist probabilities of selecting the correct model, type I and type II errors. These results hold generally, and help validate the use of posterior probabilities and L0 criteria to control frequentist error probabilities associated to model selection and hypothesis tests. Regarding regression, we help understand the effect of the sparsity imposed by the prior or the L0 penalty, and of problem characteristics such as the sample size, signal-to-noise, dimension and true sparsity. A particular finding is that one may use less sparse formulations than would be asymptotically optimal, but still attain consistency and often also significantly better finite-sample performance. We also prove new results related to misspecifying the mean or covariance structures, and give tighter rates for certain non-local priors than currently available.

5 citations


Cites background from "Approximate Laplace approximations ..."

  • ...Further extensions are possible, e.g. Propositions S1–S2 in Rossell et al. (2020) deploy Lemma 1 to the case where f∗ has sub-Gaussian errors, e.g. when y is a binary outcome....

    [...]

  • ...Propositions S1–S2 in Rossell et al. (2020) deploy Lemma 1 to the case where f∗ has sub-Gaussian errors, e....

    [...]

Journal ArticleDOI
14 Jan 2022
TL;DR: In this article , Bayesian Specification Curve Analysis (BSCA) uses Bayesian Model Averaging to incorporate covariates and heterogeneous effects across treatments, outcomes and subpopulations.
Abstract: A key issue in science is assessing robustness to data analysis choices, while avoiding selective reporting and providing valid inference. Specification Curve Analysis is a tool intended to prevent selective reporting. Alas, when used for inference it can create severe biases and false positives, due to wrongly adjusting for covariates, and mask important treatment effect heterogeneity. As our motivating application, it led an influential study to conclude there is no relevant association between technology use and teenager mental well‐being. We discuss these issues and propose a strategy for valid inference. Bayesian Specification Curve Analysis (BSCA) uses Bayesian Model Averaging to incorporate covariates and heterogeneous effects across treatments, outcomes and subpopulations. BSCA gives significantly different insights into teenager well‐being, revealing that the association with technology differs by device, gender and who assesses well‐being (teenagers or their parents).

2 citations

25 Jan 2023
TL;DR: In this paper , the authors propose a solution to solve the problem of the problem: this paper ] of "uniformity" and "uncertainty" of the solution.
Abstract: ,

1 citations

31 Jan 2023
TL;DR: In this article , the stability of posterior predictive inferences to the specification of the likelihood model and perturbations of the data generating process is studied, showing that traditional Bayesian updating provides stability across a very strict class of likelihood models and DGPs, while a generalised Bayesian alternative using the β-divergence loss function is stable across practical and interpretable neighbourhoods.
Abstract: We study the stability of posterior predictive inferences to the specification of the likelihood model and perturbations of the data generating process. In modern big data analyses, the decisionmaker may elicit useful broad structural judgements but a level of interpolation is required to arrive at a likelihood model. One model, often a computationally convenient canonical form, is chosen, when many alternatives would have been equally consistent with the elicited judgements. Equally, observational datasets often contain unforeseen heterogeneities and recording errors. Acknowledging such imprecisions, a faithful Bayesian analysis should be stable across reasonable equivalence classes for these inputs. We show that traditional Bayesian updating provides stability across a very strict class of likelihood models and DGPs, while a generalised Bayesian alternative using the β-divergence loss function is shown to be stable across practical and interpretable neighbourhoods. We illustrate this in linear regression, binary classification, and mixture modelling examples, showing that stable updating does not compromise the ability to learn about the DGP. These stability results provide a compelling justification for using generalised Bayes to facilitate inference under simplified canonical models.

1 citations

References
More filters
Journal ArticleDOI
TL;DR: In this paper, the problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms of its asymptotic expansion.
Abstract: The problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms of its asymptotic expansion. These terms are a valid large-sample criterion beyond the Bayesian context, since they do not depend on the a priori distribution.

38,681 citations

01 Jan 2005
TL;DR: In this paper, the problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms of its asymptotic expansion.
Abstract: The problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms of its asymptotic expansion. These terms are a valid large-sample criterion beyond the Bayesian context, since they do not depend on the a priori distribution.

36,760 citations


"Approximate Laplace approximations ..." refers methods in this paper

  • ...The strategy builds upon the unit information prior, a popular default leading to the Bayesian information criterion (Schwarz, 1978), the difference being that we account for the presence of groups in Zγ....

    [...]

  • ...The strategy builds upon the unit information prior, a popular default leading to the Bayesian information criterion (Schwarz, 1978), the difference being that we account for the presence of groups in Zγ ....

    [...]

Journal ArticleDOI
TL;DR: In this article, penalized likelihood approaches are proposed to handle variable selection problems, and it is shown that the newly proposed estimators perform as well as the oracle procedure in variable selection; namely, they work as well if the correct submodel were known.
Abstract: Variable selection is fundamental to high-dimensional statistical modeling, including nonparametric regression. Many approaches in use are stepwise selection procedures, which can be computationally expensive and ignore stochastic errors in the variable selection process. In this article, penalized likelihood approaches are proposed to handle these kinds of problems. The proposed methods select variables and estimate coefficients simultaneously. Hence they enable us to construct confidence intervals for estimated parameters. The proposed approaches are distinguished from others in that the penalty functions are symmetric, nonconcave on (0, ∞), and have singularities at the origin to produce sparse solutions. Furthermore, the penalty functions should be bounded by a constant to reduce bias and satisfy certain conditions to yield continuous solutions. A new algorithm is proposed for optimizing penalized likelihood functions. The proposed ideas are widely applicable. They are readily applied to a variety of ...

8,314 citations


"Approximate Laplace approximations ..." refers methods in this paper

  • ...EF0 ( p̃L(S|y)) ≤ ( | ̃ ∗ | + 1)(J − | ̃ ∗ |)[log ((ngL)q∕2 ) + (c + 1)log(p) + ] pa(c+1)−1(ngL) aq∕2 , EF0 ( p̃L(Sc|y)) ≤ (|̃∗|+1)e(| ̃ ∗|+2)logJ [e pc(ng)q∕2]b + e| ̃∗|logJ e + e| ̃∗|logJ e(| ̃∗|+1)[ −clog(p)−(q�∕2)log(ng)] . gZellner calculations, the group LASSO (Bakin, 1999), group SCAD (Fan & Li, 2001) and group MCP (Zhang, 2010)....

    [...]

  • ...We also compare the gMOM ALA model selection performance in a non-linear regression example to exact gZellner calculations, the group LASSO (Bakin, 1999), group SCAD (Fan and Li, 2001)...

    [...]

  • ...…̃ ∗ |)[log ((ngL)q∕2 ) + (c + 1)log(p) + ] pa(c+1)−1(ngL) aq∕2 , EF0 ( p̃L(Sc|y)) ≤ (|̃∗|+1)e(| ̃ ∗|+2)logJ [e pc(ng)q∕2]b + e| ̃∗|logJ e + e| ̃∗|logJ e(| ̃∗|+1)[ −clog(p)−(q�∕2)log(ng)] . gZellner calculations, the group LASSO (Bakin, 1999), group SCAD (Fan & Li, 2001) and group MCP (Zhang, 2010)....

    [...]

  • ...Packages grplasso and grpreg (Breheny & Huang, 2015) were were used to implement group LASSO, group SCAD and group MCP....

    [...]

Journal ArticleDOI
TL;DR: In this article, an easily interpretable index of predictive discrimination as well as methods for assessing calibration of predicted survival probabilities are discussed, which are particularly needed for binary, ordinal, and time-to-event outcomes.
Abstract: Multivariable regression models are powerful tools that are used frequently in studies of clinical outcomes. These models can use a mixture of categorical and continuous variables and can handle partially observed (censored) responses. However, uncritical application of modelling techniques can result in models that poorly fit the dataset at hand, or, even more likely, inaccurately predict outcomes on new subjects. One must know how to measure qualities of a model's fit in order to avoid poorly fitted or overfitted models. Measurement of predictive accuracy can be difficult for survival time data in the presence of censoring. We discuss an easily interpretable index of predictive discrimination as well as methods for assessing calibration of predicted survival probabilities. Both types of predictive accuracy should be unbiasedly validated using bootstrapping or cross-validation, before using predictions in a new data series. We discuss some of the hazards of poorly fitted and overfitted regression models and present one modelling strategy that avoids many of the problems discussed. The methods described are applicable to all regression models, but are particularly needed for binary, ordinal, and time-to-event outcomes. Methods are illustrated with a survival analysis in prostate cancer using Cox regression.

7,879 citations

01 Jan 1996
TL;DR: An easily interpretable index of predictive discrimination as well as methods for assessing calibration of predicted survival probabilities are discussed, which are applicable to all regression models, but are particularly needed for binary, ordinal, and time-to-event outcomes.
Abstract: SUMMARY Multivariable regression models are powerful tools that are used frequently in studies of clinical outcomes. These models can use a mixture of categorical and continuous variables and can handle partially observed (censored) responses. However, uncritical application of modelling techniques can result in models that poorly fit the dataset at hand, or, even more likely, inaccurately predict outcomes on new subjects. One must know how to measure qualities of a model's fit in order to avoid poorly fitted or overfitted models. Measurement of predictive accuracy can be difficult for survival time data in the presence of censoring. We discuss an easily interpretable index of predictive discrimination as well as methods for assessing calibration of predicted survival probabilities. Both types of predictive accuracy should be unbiasedly validated using bootstrapping or cross-validation, before using predictions in a new data series. We discuss some of the hazards of poorly fitted and overfitted regression models and present one modelling strategy that avoids many of the problems discussed. The methods described are applicable to all regression models, but are particularly needed for binary, ordinal, and time-to-event outcomes. Methods are illustrated with a survival analysis in prostate cancer using Cox regression. Accurate estimation of patient prognosis is important for many reasons. First, prognostic estimates can be used to inform the patient about likely outcomes of her disease. Second, the physician can use estimates of prognosis as a guide for ordering additional tests and selecting appropriate therapies. Third, prognostic assessments are useful in the evaluation of technologies; prognostic estimates derived both with and without using the results of a given test can be compared to measure the incremental prognostic information provided by that test over what is provided by prior information.' Fourth, a researcher may want to estimate the effect of a single factor (for example, treatment given) on prognosis in an observational study in which many uncontrolled confounding factors are also measured. Here the simultaneous effects of the uncontrolled variables must be controlled (held constant mathematically if using a regression model) so that the effect of the factor of interest can be more purely estimated. An analysis of how variables (especially continuous ones) affect the patient outcomes of interest is necessary to

4,782 citations