scispace - formally typeset
Search or ask a question

Showing papers on "Model selection published in 1999"


Journal ArticleDOI
TL;DR: Bayesian model averaging (BMA) provides a coherent mechanism for ac- counting for this model uncertainty and provides improved out-of- sample predictive performance.
Abstract: Standard statistical practice ignores model uncertainty. Data analysts typically select a model from some class of models and then proceed as if the selected model had generated the data. This approach ignores the uncertainty in model selection, leading to over-confident inferences and decisions that are more risky than one thinks they are. Bayesian model averaging (BMA)provides a coherent mechanism for accounting for this model uncertainty. Several methods for implementing BMA have recently emerged. We discuss these methods and present a number of examples.In these examples, BMA provides improved out-of-sample predictive performance. We also provide a catalogue of currently available BMA software.

3,942 citations


Proceedings Article
29 Nov 1999
TL;DR: This paper presents a novel practical framework for Bayesian model averaging and model selection in probabilistic graphical models that approximates full posterior distributions over model parameters and structures, as well as latent variables, in an analytical manner.
Abstract: This paper presents a novel practical framework for Bayesian model averaging and model selection in probabilistic graphical models. Our approach approximates full posterior distributions over model parameters and structures, as well as latent variables, in an analytical manner. These posteriors fall out of a free-form optimization procedure, which naturally incorporates conjugate priors. Unlike in large sample approximations, the posteriors are generally non-Gaussian and no Hessian needs to be computed. Predictive quantities are obtained analytically. The resulting algorithm generalizes the standard Expectation Maximization algorithm, and its convergence is guaranteed. We demonstrate that this approach can be applied to a large class of models in several domains, including mixture models and source separation.

870 citations


Journal ArticleDOI
TL;DR: It is shown that the quadratic risk of the minimum penalized empirical contrast estimator is bounded by an index of the accuracy of the sieve, which quantifies the trade-off among the candidate models between the approximation error and parameter dimension relative to sample size.
Abstract: Performance bounds for criteria for model selection are devel- oped using recent theory for sieves. The model selection criteria are based on an empirical loss or contrast function with an added penalty term moti- vated by empirical process theory and roughly proportional to the number of parameters needed to describe the model divided by the number of ob- servations. Most of our examples involve density or regression estimation settings and we focus on the problem of estimating the unknown density or regression function. We show that the quadratic risk of the minimum penal- ized empirical contrast estimatoris bounded by an index of the accuracy of the sieve. This accuracy index quanties the trade-off among the candidate models between the approximation error and parameter dimension relative to sample size. If we choose a list of models which exhibit good approximation prop- erties with respect to different classes of smoothness, the estimator can be simultaneously minimax rate optimal in each of those classes. This is what is usually called adaptation. The type of classes of smoothness in which one gets adaptation depends heavily on the list of models. If too many models are involved in order to get accurate approximation of many wide classes of functions simultaneously, it may happen that the estimator is only approx-

801 citations


Journal ArticleDOI
TL;DR: In this article, the authors used model selection criteria in order to verify recent evidence of predictability in excess stock returns and to determine which variables are valuable predictors, and they found that even the best prediction models have no out-of-sample forecasting power.
Abstract: Statistical model selection criteria provide an informed choice of the model with best external (i.e., out-of-sample) validity. Therefore they guard against overfitting ('data snooping'). We implement several model selection criteria in order to verify recent evidence of predictability in excess stock returns and to determine which variables are valuable predictors. We confirm the presence of in-sample predictability in an international stock market dataset, but discover that even the best prediction models have no out-of-sample forecasting power. The failure to detect out-of-sample predictability is not due to lack of power.

578 citations


Proceedings Article
29 Nov 1999
TL;DR: New functionals for parameter (model) selection of Support Vector Machines are introduced based on the concepts of the span of support vectors and rescaling of the feature space and it is shown that using these functionals one can both predict the best choice of parameters of the model and the relative quality of performance for any value of parameter.
Abstract: New functionals for parameter (model) selection of Support Vector Machines are introduced based on the concepts of the span of support vectors and rescaling of the feature space. It is shown that using these functionals, one can both predict the best choice of parameters of the model and the relative quality of performance for any value of parameter.

392 citations


Journal ArticleDOI
TL;DR: In this paper, the authors studied the structural properties of stationary variable length Markov chains (VLMCs) on a finite space and proposed a new bootstrap scheme based on fitted VLMCs.
Abstract: We study estimation in the class of stationary variable length Markov chains (VLMC) on a finite space. The processes in this class are still Markovian of high order, but with memory of variable length yielding a much bigger and structurally richer class of models than ordinary high-order Markov chains. From an algorithmic view, the VLMC model class has attracted interest in information theory and machine learning, but statistical properties have not yet been explored. Provided that good estimation is available, the additional structural richness of the model class enhances predictive power by finding a better trade-off between model bias and variance and allowing better structural description which can be of specific interest. The latter is exemplified with some DNA data. A version of the tree-structured context algorithm, proposed by Rissanen in an information theoretical set-up is shown to have new good asymptotic properties for estimation in the class of VLMCs. This remains true even when the underlying model increases in dimensionality. Furthermore, consistent estimation of minimal state spaces and mixing properties of fitted models are given. We also propose a new bootstrap scheme based on fitted VLMCs. We show its validity for quite general stationary categorical time series and for a broad range of statistical procedures.

369 citations


Journal ArticleDOI
TL;DR: In this article, the authors consider a generalized method of moments (GMM) estimation problem in which one has a vector of moment conditions, some of which are correct and some incorrect.
Abstract: This paper considers a generalized method of moments (GMM) estimation problem in which one has a vector of moment conditions, some of which are correct and some incorrect. The paper introduces several procedures for consistently selecting the correct moment conditions. The procedures also can consistently determine whether there is a sufficient number of correct moment conditions to identify the unknown parameters of interest. The paper specifies moment selection criteria that are GMM analogues of the widely used BIC and AIC model selection criteria. (The latter is not consistent.) The paper also considers downward and upward testing procedures. All of the moment selection procedures discussed in this paper are based on the minimized values of the GMM criterion function for different vectors of moment conditions. The procedures are applicable in time-series and cross-sectional contexts. Application of the results of the paper to instrumental variables estimation problems yields consistent procedures for selecting instrumental variables.

345 citations


Journal ArticleDOI
TL;DR: This article examines how model selection in neural networks can be guided by statistical procedures such as hypothesis tests, information criteria and cross validation, and proposes five specification strategies based on different statistical procedures.

340 citations


Journal ArticleDOI
TL;DR: The Bayesian Information Criterion (BIC) has become a popular criterion for model selection in recent years as mentioned in this paper, and is intended to provide a measure of the weight of evidence favoring one model over another.
Abstract: The Bayesian information criterion (BIC) has become a popular criterion for model selection in recent years. The BIC is intended to provide a measure of the weight of evidence favoring one model ov...

313 citations


Journal ArticleDOI
TL;DR: In this paper, a memory-based technique for local modeling and control of unknown non-linear dynamical systems is proposed, which uses a query-based approach to select the best model configuration by assessing and comparing different alternatives.
Abstract: This paper presents local methods for modelling and control of discrete-time unknown non-linear dynamical systems, when only input-output data are available. We propose the adoption of lazy learning, a memory-based technique for local modelling. The modelling procedure uses a query-based approach to select the best model configuration by assessing and comparing different alternatives. A new recursive technique for local model identification and validation is presented, together with an enhanced statistical method for model selection. A lso, three methods to design controllers based on the local linearization provided by the lazy learning algorithm are described. In the first method the lazy technique returns the forward and inverse models of the system which are used to compute the control action to take. The second is an indirect method inspired by self-tuning regulators where recursive least squares estimation is replaced by a local approximator. The third method combines the linearization provided by t...

248 citations


Journal ArticleDOI
TL;DR: This paper addresses the problem of testing hypotheses using the likelihood ratio test statistic in nonidentifiable models, with application to model selection in situations where the parametrization for the larger model leads to nonidentifiability in the smaller model.
Abstract: In this paper, we address the problem of testing hypotheses using the likelihood ratio test statistic in nonidentifiable models, with application to model selection in situations where the parametrization for the larger model leads to nonidentifiability in the smaller model. We give two major applications: the case where the number of populations has to be tested in a mixture and the case of stationary ARMA$(p, q)$ processes where the order $(p, q)$ has to be tested. We give the asymptotic distribution for the likelihood ratio test statistic when testing the order of the model. In the case of order selection for ARMAs, the asymptotic distribution is invariant with respect to the parameters generating the process. A locally conic parametrization is a key tool in deriving the limiting distributions; it allows one to discover the deep similarity between the two problems.


Journal ArticleDOI
TL;DR: Some aspects of signal detection theory relevant to FNI and, in addition, some common approaches to statistical inference used in FNI are discussed; low-pass filtering in relation to functional-anatomical variability and some effects of filtering on signal detection of interest to F NI are discussed.
Abstract: The field of functional neuroimaging (FNI) methodology has developed into a mature but evolving area of knowledge and its applications have been extensive. A general problem in the analysis of FNI data is finding a signal embedded in noise. This is sometimes called signal detection. Signal detection theory focuses in general on issues relating to the optimization of conditions for separating the signal from noise. When methods from probability theory and mathematical statistics are directly applied in this procedure it is also called statistical inference. In this paper we briefly discuss some aspects of signal detection theory relevant to FNI and, in addition, some common approaches to statistical inference used in FNI. Low-pass filtering in relation to functional-anatomical variability and some effects of filtering on signal detection of interest to FNI are discussed. Also, some general aspects of hypothesis testing and statistical inference are discussed. This includes the need for characterizing the signal in data when the null hypothesis is rejected, the problem of multiple comparisons that is central to FNI data analysis, omnibus tests and some issues related to statistical power in the context of FNI. In turn, random field, scale space, non-parametric and Monte Carlo approaches are reviewed, representing the most common approaches to statistical inference used in FNI. Complementary to these issues an overview and discussion of non-inferential descriptive methods, common statistical models and the problem of model selection is given in a companion paper. In general, model selection is an important prelude to subsequent statistical inference. The emphasis in both papers is on the assumptions and inherent limitations of the methods presented. Most of the methods described here generally serve their purposes well when the inherent assumptions and limitations are taken into account. Significant differences in results between different methods are most apparent in extreme parameter ranges, for example at low effective degrees of freedom or at small spatial autocorrelation. In such situations or in situations when assumptions and approximations are seriously violated it is of central importance to choose the most suitable method in order to obtain valid results.

Journal ArticleDOI
TL;DR: A model selection criterion is proposed which serves as an asymptotically unbiased estimator of a variant of the symmetric divergence between the true model and a fitted approximating model.

Journal ArticleDOI
TL;DR: Empirical comparisons between model selection using VC-bounds and classical methods are performed for various noise levels, sample size, target functions and types of approximating functions, demonstrating the advantages of VC-based complexity control with finite samples.
Abstract: It is well known that for a given sample size there exists a model of optimal complexity corresponding to the smallest prediction (generalization) error. Hence, any method for learning from finite samples needs to have some provisions for complexity control. Existing implementations of complexity control include penalization (or regularization), weight decay (in neural networks), and various greedy procedures (aka constructive, growing, or pruning methods). There are numerous proposals for determining optimal model complexity (aka model selection) based on various (asymptotic) analytic estimates of the prediction risk and on resampling approaches. Nonasymptotic bounds on the prediction risk based on Vapnik-Chervonenkis (VC)-theory have been proposed by Vapnik. This paper describes application of VC-bounds to regression problems with the usual squared loss. An empirical study is performed for settings where the VC-bounds can be rigorously applied, i.e., linear models and penalized linear models where the VC-dimension can be accurately estimated, and the empirical risk can be reliably minimized. Empirical comparisons between model selection using VC-bounds and classical methods are performed for various noise levels, sample size, target functions and types of approximating functions. Our results demonstrate the advantages of VC-based complexity control with finite samples.

Journal ArticleDOI
TL;DR: Different methods are suggested, such as a rationalized OAT screening test, a regression-based method, and two implementations of global quantitative sensitivity analysis measures, and an example is offered.
Abstract: Some recent articles are reviewed where sensitivity analysis (SA) is implemented via either an elementary “one factor at a time” (OAT) approach or via a derivative-based method. In these works, as customary, SA is used for mechanism identification and/or model selection. OAT and derivative based methods have important limitations: (1) Only a reduced portion of the space of the input factors is explored, (2) the possibility that factors might interact is discounted, (3) the methods do not allow self-verification. Given that all models involved are highly nonlinear and potentially nonadditive, the adopted methods might fail to provide the full effect of any given factor on the output. This could deceive the analyst, unless the analysis were really meant to focus on a narrow range around the nominal value, where linearity may be assumed. Different methods are suggested, such as a rationalized OAT screening test, a regression-based method, and two implementations of global quantitative sensitivity analysis measures. Computational cost, efficiency, and limitations of the proposed strategies are discussed, and an example is offered.

Journal Article
TL;DR: An improved risk bound for ARM is obtained and it is demonstrated that when AIC and BIC are combined, the mixed estimator automatically behaves like the better one, and ARM also performs better than BMA techniques based on BIC approximation.
Abstract: Model combining (mixing) provides an alternative to model selection. An algorithm ARM was recently proposed by the author to combine different regres- sion models/methods. In this work, an improved risk bound for ARM is obtained. In addition to some theoretical observations on the issue of selection versus com- bining, simulations are conducted in the context of linear regression to compare performance of ARM with the familiar model selection criteria AIC and BIC, and also with some Bayesian model averaging (BMA) methods. The simulation suggests the following. Selection can yield a smaller risk when the random error is weak relative to the signal. However, when the random noise level gets higher, ARM produces a better or even much better estimator. That is, mixing appropriately is advantageous when there is a certain degree of uncer- tainty in choosing the best model. In addition, it is demonstrated that when AIC and BIC are combined, the mixed estimator automatically behaves like the better one. A comparison with bagging (Breiman (1996)) suggests that ARM does better than simply stabilizing model selection estimators. In our simulation, ARM also performs better than BMA techniques based on BIC approximation.

Journal ArticleDOI
TL;DR: Some non-inferential descriptive methods and common statistical models used in FNI are discussed and issues relating to the complex problem of model selection are discussed.
Abstract: Functional neuroimaging (FNI) provides experimental access to the intact living brain making it possible to study higher cognitive functions in humans. In this review and in a companion paper in this issue, we discuss some common methods used to analyse FNI data. The emphasis in both papers is on assumptions and limitations of the methods reviewed. There are several methods available to analyse FNI data indicating that none is optimal for all purposes. In order to make optimal use of the methods available it is important to know the limits of applicability. For the interpretation of FNI results it is also important to take into account the assumptions, approximations and inherent limitations of the methods used. This paper gives a brief overview over some non-inferential descriptive methods and common statistical models used in FNI. Issues relating to the complex problem of model selection are discussed. In general, proper model selection is a necessary prerequisite for the validity of the subsequent statistical inference. The non-inferential section describes methods that, combined with inspection of parameter estimates and other simple measures, can aid in the process of model selection and verification of assumptions. The section on statistical models covers approaches to global normalization and some aspects of univariate, multivariate, and Bayesian models. Finally, approaches to functional connectivity and effective connectivity are discussed. In the companion paper we review issues related to signal detection and statistical inference.

Journal ArticleDOI
TL;DR: In this article, the authors extend the analysis of Phillips and Ploberger (1996) on the Posterior Information Criterion (PIC) to a partially nonstationary vector autoregressive process with reduced rank structure.

Journal ArticleDOI
TL;DR: Based on a review of criteria, it is recommended that use of criteria that are based on Kullback-Leibler information in the biolog...
Abstract: We provide background information to allow a heuristic understanding of two types of criteria used in selecting a model for making inferences from ringing data. The first type of criteria (e.g. AIC, AlCc QAICc and TIC) are estimates of (relative) Kullback-Leibler information or distance and attempt to select a good approximating model for inference, based on the principle of parsimony. The second type of criteria (e.g. BIC, MDL, HQ) are 'dimension consistent' in that they attempt to consistently estimate the dimension of the true model. These latter criteria assume that a true model exists, that it is in the set of candidate models and that the goal of model selection is to find the true model, which in turn requires that the sample size is very large. The Kullback-Leibler based criteria do not assume a true model exists, let alone that it is in the set of models being considered. Based on a review of these criteria, we recommend use of criteria that are based on Kullback-Leibler information in the biolog...

Journal ArticleDOI
TL;DR: In this article, the covariance inflation criterion adjusts the training error by the average covariance of the predictions and responses, when the prediction rule is applied to permuted versions of the data set.
Abstract: We propose a new criterion for model selection in prediction problems. The covariance inflation criterion adjusts the training error by the average covariance of the predictions and responses, when the prediction rule is applied to permuted versions of the data set. This criterion can be applied to general prediction problems (e.g. regression or classification) and to general prediction rules (e.g. stepwise regression, tree-based models and neural nets). As a by-product we obtain a measure of the effective number of parameters used by an adaptive procedure. We relate the covariance inflation criterion to other model selection procedures and illustrate its use in some regression and classification problems. We also revisit the conditional bootstrap approach to model selection.

Proceedings Article
29 Nov 1999
TL;DR: A variational Bayesian method for model selection over families of kernels classifiers like Support Vector machines or Gaussian processes that needs no user interaction and is able to adapt a large number of kernel parameters to given data without having to sacrifice training cases for validation.
Abstract: We present a variational Bayesian method for model selection over families of kernels classifiers like Support Vector machines or Gaussian processes. The algorithm needs no user interaction and is able to adapt a large number of kernel parameters to given data without having to sacrifice training cases for validation. This opens the possibility to use sophisticated families of kernels in situations where the small "standard kernel" classes are clearly inappropriate. We relate the method to other work done on Gaussian processes and clarify the relation between Support Vector machines and certain Gaussian process models.

Journal ArticleDOI
TL;DR: To better justify the widespread applicability of SIC, the criterion is derived in a very general framework: one which does not assume any specific form for the likelihood function, but only requires that it satisfies certain non-restrictive regularity conditions.
Abstract: The Schwarz information criterion (SIC, BTC, SBC) is one of the most widely known and used tools in statistical model selection. The criterion was derived by Schwarz (1978) to serve as an asymptotic approximation to a transformation of the Bayesian posterior probability of a candidate model. Although the original derivation assumes that the observed data is independent, identically distributed, and arising from a probability distribution in the regular exponential family, SIC has traditionally been used in a much larger scope of model selection problems. To better justify the widespread applicability of SIC, we derive the criterion in a very general framework: one which does not assume any specific form for the likelihood function, but only requires that it satisfies certain non-restrictive regularity conditions.

Journal ArticleDOI
TL;DR: Results on applying the evidence framework to the real-world data sets showed that committees of Bayesian networks achieved classification accuracies similar to the best alternative methods with a minimum of human intervention.

Journal ArticleDOI
TL;DR: A general, consistent strategy for data analysis is outlined, based on information and likelihood theory, which indicates that model selection uncertainty can be quantified and should be incorporated into estimators of precision.
Abstract: A general, consistent strategy for data analysis is outlined, based on information and likelihood theory. A priori considerations lead to the definition of a set of candidate models, simple criteria are useful in ranking and calibrating the models based on estimates of (relative) Kullback-Leibler information, inference can be based on either the best model or a weighted average of several models. Model selection uncertainty can be quantified and should be incorporated into estimators of precision. Some comments are offered on statistical hypothesis testing and data dredging.

Journal ArticleDOI
TL;DR: Some Bayesian discretized semiparametric models, incorporating proportional and nonproportional hazards structures, along with associated statistical analyses and tools for model selection using sampling-based methods are presented.
Abstract: Summary. Interval-censored data occur in survival analysis when the survival time of each patient is only known to be within an interval and these censoring intervals differ from patient to patient. For such data, we present some Bayesian discretized semiparametric models, incorporating proportional and nonproportional hazards structures, along with associated statistical analyses and tools for model selection using sampling-based methods. The scope of these methodologies is illustrated through a reanalysis of a breast cancer data set (Finkelstein, 1986, Biometrics42, 845–854) to test whether the effect of covariate on survival changes over time.

Journal ArticleDOI
TL;DR: In this paper, the general utility of parsimony in structural equation model selection is discussed, with emphasis on the extent to which one may be willing to routinely use parsimony as the only principle to follow in structural model selection.
Abstract: This article is concerned with issues in structural equation model selection that pertain to the general utility of the well‐known principle of parsimony. An example is provided using data generated by a relatively nonparsimonious simplex model and fitted rather well by a parsimonious growth curve model that belongs to a different class of models. Implications for empirical research are subsequently discussed, with emphasis on the extent to which one may be willing to routinely use parsimony as the only principle to follow in structural model selection.

Journal ArticleDOI
TL;DR: In this paper, general methods for testing the fit of a parametric function are proposed, and several different selection criteria are considered, including one based on a modified version of the Akaike information criterion and others based on various score statistics.
Abstract: General methods for testing the fit of a parametric function are proposed. The idea underlying each method is to “accept” the prescribed parametric model if and only if it is chosen by a model selection criterion. Several different selection criteria are considered, including one based on a modified version of the Akaike information criterion and others based on various score statistics. The tests have a connection with nonparametric smoothing because they use orthogonal series estimators to detect departures from a parametric model. An important aspect of the tests is that they can be applied in a wide variety of settings, including generalized linear models, spectral analysis, the goodness-of-fit problem, and longitudinal data analysis. Implementation using standard statistical software is straightforward. Asymptotic distribution theory for several test statistics is described, and the tests are shown to be consistent against essentially any alternative hypothesis. Simulations and a data exampl...

Journal ArticleDOI
TL;DR: In this article, a general-to-specific model selection framework for testing the data admissibility of the principal models in current use is presented, with the market model generally outperforming the capital asset pricing model.
Abstract: The choice of model of normal returns in event studies has been widely discussed in the literature. While researchers frequently continue to use an array of alternatives, there is currently some tendency to favour cruder but simpler mean- or market-adjusted returns models. This paper presents a general-to-specific model selection framework for testing the data admissibility of the principal models in current use. Results from a pilot study indicate a strong preliminary preference in favour of the regression-based models, with the market model generally outperforming the capital asset pricing model.

Journal ArticleDOI
TL;DR: A coherent view of the two recent models used for multiple sequence alignment—the hidden Markov model (HMM) and the block-based motif model—is provided to develop a set of new algorithms that have both the sensitivity and flexibility of the HMM.
Abstract: The alignment of multiple homologous biopolymer sequences is crucial in research on protein modeling and engineering, molecular evolution, and prediction in terms of both gene function and gene product structure. In this article we provide a coherent view of the two recent models used for multiple sequence alignment—the hidden Markov model (HMM) and the block-based motif model—to develop a set of new algorithms that have both the sensitivity of the block-based model and the flexibility of the HMM. In particular, we decompose the standard HMM into two components: the insertion component, which is captured by the so-called “propagation model,” and the deletion component, which is described by a deletion vector. Such a decomposition serves as a basis for rational compromise between biological specificity and model flexibility. Furthermore, we introduce a Bayesian model selection criterion that—in combination with the propagation model, genetic algorithm, and other computational aspects—forms the cor...