scispace - formally typeset
Search or ask a question

Showing papers on "Model selection published in 2003"


Journal ArticleDOI
TL;DR: The contributions of this special issue cover a wide range of aspects of variable selection: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.
Abstract: Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available. These areas include text processing of internet documents, gene expression array analysis, and combinatorial chemistry. The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data. The contributions of this special issue cover a wide range of aspects of such problems: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.

14,509 citations


Journal ArticleDOI
TL;DR: In this paper, the problem of estimating the break dates and the number of breaks in a linear model with multiple structural changes has been considered and an efficient algorithm based on the principle of dynamic programming has been proposed.
Abstract: In a recent paper, Bai and Perron (1998) considered theoretical issues related to the limiting distribution of estimators and test statistics in the linear model with multiple structural changes. In this companion paper, we consider practical issues for the empirical applications of the procedures. We first address the problem of estimation of the break dates and present an efficient algorithm to obtain global minimizers of the sum of squared residuals. This algorithm is based on the principle of dynamic programming and requires at most least-squares operations of order O(T2) for any number of breaks. Our method can be applied to both pure and partial structural change models. Second, we consider the problem of forming confidence intervals for the break dates under various hypotheses about the structure of the data and the errors across segments. Third, we address the issue of testing for structural changes under very general conditions on the data and the errors. Fourth, we address the issue of estimating the number of breaks. Finally, a few empirical applications are presented to illustrate the usefulness of the procedures. All methods discussed are implemented in a GAUSS program. Copyright © 2002 John Wiley & Sons, Ltd.

4,026 citations



Journal ArticleDOI
TL;DR: The production of low rank smoothers for d’≥ 1 dimensional data, which can be fitted by regression or penalized regression methods, are discussed, which allow the use of approximate thin plate spline models with large data sets, and provide a sensible way of modelling interaction terms in generalized additive models.
Abstract: discuss the production of low rank smoothers for d greater than or equal to 1 dimensional data, which can be fitted by regression or penalized regression methods. The smoothers are constructed by a simple transformation and truncation of the basis that arises from the solution of the thin plate spline smoothing problem and are optimal in the sense that the truncation is designed to result in the minimum possible perturbation of the thin plate spline smoothing problem given the dimension of the basis used to construct the smoother. By making use of Lanczos iteration the basis change and truncation are computationally efficient. The smoothers allow the use of approximate thin plate spline models with large data sets, avoid the problems that are associated with 'knot placement' that usually complicate modelling with regression splines or penalized regression splines, provide a sensible way of modelling interaction terms in generalized additive models, provide low rank approximations to generalized smoothing spline models, appropriate for use with large data sets, provide a means for incorporating smooth functions of more than one variable into non-linear models and improve the computational efficiency of penalized likelihood models incorporating thin plate splines. Given that the approach produces spline-like models with a sparse basis, it also provides a natural way of incorporating unpenalized spline-like terms in linear and generalized linear models, and these can be treated just like any other model terms from the point of view of model selection, inference and diagnostics

1,948 citations


Journal ArticleDOI
TL;DR: In this paper, the authors used intrinsic and extrinsic measures of model performance to determine whether optimal models can be identified based on objective intrinsic criteria, without resorting to an independent test data set.

1,138 citations


Proceedings Article
09 Dec 2003
TL;DR: A Bayesian approach is taken to generate an appropriate prior via a distribution on partitions that allows arbitrarily large branching factors and readily accommodates growing data collections.
Abstract: We address the problem of learning topic hierarchies from data. The model selection problem in this domain is daunting—which of the large collection of possible trees to use? We take a Bayesian approach, generating an appropriate prior via a distribution on partitions that we refer to as the nested Chinese restaurant process. This nonparametric prior allows arbitrarily large branching factors and readily accommodates growing data collections. We build a hierarchical topic model by combining this prior with a likelihood that is based on a hierarchical variant of latent Dirichlet allocation. We illustrate our approach on simulated data and with an application to the modeling of NIPS abstracts.

1,055 citations


Journal ArticleDOI
TL;DR: In this paper, a general large-sample likelihood apparatus is presented, in which limiting distributions and risk properties of estimators post-selection as well as of model average estimators are precisely described, also explicitly taking modeling bias into account.
Abstract: The traditional use of model selection methods in practice is to proceed as if the final selected model had been chosen in advance, without acknowledging the additional uncertainty introduced by model selection. This often means underreporting of variability and too optimistic confidence intervals. We build a general large-sample likelihood apparatus in which limiting distributions and risk properties of estimators post-selection as well as of model average estimators are precisely described, also explicitly taking modeling bias into account. This allows a drastic reduction in complexity, as competing model averaging schemes may be developed, discussed, and compared inside a statistical prototype experiment where only a few crucial quantities matter. In particular, we offer a frequentist view on Bayesian model averaging methods and give a link to generalized ridge estimators. Our work also leads to new model selection criteria. The methods are illustrated with real data applications.

662 citations


01 Jan 2003
TL;DR: It is shown that in a rigorous sense, even in the setting that the true model is included in the candidates, the above mentioned main strengths of AIC and BIC cannot be shared.
Abstract: It is well known that AIC and BIC have dierent properties in model selection. BIC is consistent in the sense that if the true model is among the candidates, the probability of selecting the true model approaches 1. On the other hand, AIC is minimax-rate optimal for both parametric and nonparametric cases for estimating the regression function. There are several successful results on constructing new model selection criteria to share some strengths of AIC and BIC. However, we show that in a rigorous sense, even in the setting that the true model is included in the candidates, the above mentioned main strengths of AIC and BIC cannot be shared. That is, for any model selection criterion to be consistent, it must behave sup-optimally compared to AIC in terms of mean average squared error.

554 citations


Journal Article
TL;DR: This paper addresses a common methodological flaw in the comparison of variable selection methods by addressing the problem of cross-validation performance estimates of the different variable subsets used with computationally intensive search algorithms.
Abstract: This paper addresses a common methodological flaw in the comparison of variable selection methods. A practical approach to guide the search or the selection process is to compute cross-validation performance estimates of the different variable subsets. Used with computationally intensive search algorithms, these estimates may overfit and yield biased predictions. Therefore, they cannot be used reliably to compare two selection methods, as is shown by the empirical results of this paper. Instead, like in other instances of the model selection problem, independent test sets should be used for determining the final performance. The claims made in the literature about the superiority of more exhaustive search algorithms over simpler ones are also revisited, and some of them infirmed.

512 citations


Proceedings Article
01 Jan 2003
TL;DR: A method for the sparse greedy approximation of Bayesian Gaussian process regression, featuring a novel heuristic for very fast forward selection, which leads to a sufficiently stable approximation of the log marginal likelihood of the training data, which can be optimised to adjust a large number of hyperparameters automatically.
Abstract: We present a method for the sparse greedy approximation of Bayesian Gaussian process regression, featuring a novel heuristic for very fast forward selection Our method is essentially as fast as an equivalent one which selects the "support" patterns at random, yet it can outperform random selection on hard curve fitting tasks More importantly, it leads to a sufficiently stable approximation of the log marginal likelihood of the training data, which can be optimised to adjust a large number of hyperparameters automatically We demonstrate the model selection capabilities of the algorithm in a range of experiments In line with the development of our method, we present a simple view on sparse approximations for GP models and their underlying assumptions and show relations to other methods

487 citations


Journal ArticleDOI
TL;DR: This work develops a novel approach to model selection, which is based on the Bayesian information criterion, but incorporates relative branch-length error as a performance measure in a decision theory (DT) framework.
Abstract: Phylogenetic estimation has largely come to rely on explicitly model-based methods. This approach requires that a model be chosen and that that choice be justified. To date, justification has largely been accomplished through use of likelihood-ratio tests (LRTs) to assess the relative fit of a nested series of reversible models. While this approach certainly represents an important advance over arbitrary model selection, the best fit of a series of models may not always provide the most reliable phylogenetic estimates for finite real data sets, where all available models are surely incorrect. Here, we develop a novel approach to model selection, which is based on the Bayesian information criterion, but incorporates relative branch-length error as a performance measure in a decision theory (DT) framework. This DT method includes a penalty for overfitting, is applicable prior to running extensive analyses, and simultaneously compares all models being considered and thus does not rely on a series of pairwise comparisons of models to traverse model space. We evaluate this method by examining four real data sets and by using those data sets to define simulation conditions. In the real data sets, the DT method selects the same or simpler models than conventional LRTs. In order to lend generality to the simulations, codon-based models (with parameters estimated from the real data sets) were used to generate simulated data sets, which are therefore more complex than any of the models we evaluate. On average, the DT method selects models that are simpler than those chosen by conventional LRTs. Nevertheless, these simpler models provide estimates of branch lengths that are more accurate both in terms of relative error and absolute error than those derived using the more complex (yet still wrong) models chosen by conventional LRTs. This method is available in a program called DT-ModSel. (Bayesian model selection; decision theory; incorrect models; likelihood ratio test; maximum likelihood; nucleotide-substitution model; phylogeny.)

Journal ArticleDOI
TL;DR: In this paper, a model selector should instead focus on the parameter singled out for interest; in particular, a model that gives good precision for one estimand may be worse when used for inference for another estimand.
Abstract: A variety of model selection criteria have been developed, of general and specific types. Most of these aim at selecting a single model with good overall properties, for example, formulated via average prediction quality or shortest estimated overall distance to the true model. The Akaike, the Bayesian, and the deviance information criteria, along with many suitable variations, are examples of such methods. These methods are not concerned, however, with the actual use of the selected model, which varies with context and application. The present article takes the view that the model selector should instead focus on the parameter singled out for interest; in particular, a model that gives good precision for one estimand may be worse when used for inference for another estimand. We develop a method that, for a given focus parameter, estimates the precision of any submodel-based estimator. The framework is that of large-sample likelihood inference. Using an unbiased estimate of limiting risk, we propose a foc...

Journal ArticleDOI
TL;DR: By applying the PAC-Bayesian theorem of McAllester (1999a), this paper proves distribution-free generalisation error bounds for a wide range of approximate Bayesian GP classification techniques, giving a strong learning-theoretical justification for the use of these techniques.
Abstract: Approximate Bayesian Gaussian process (GP) classification techniques are powerful non-parametric learning methods, similar in appearance and performance to support vector machines. Based on simple probabilistic models, they render interpretable results and can be embedded in Bayesian frameworks for model selection, feature selection, etc. In this paper, by applying the PAC-Bayesian theorem of McAllester (1999a), we prove distribution-free generalisation error bounds for a wide range of approximate Bayesian GP classification techniques. We also provide a new and much simplified proof for this powerful theorem, making use of the concept of convex duality which is a backbone of many machine learning techniques. We instantiate and test our bounds for two particular GPC techniques, including a recent sparse method which circumvents the unfavourable scaling of standard GP algorithms. As is shown in experiments on a real-world task, the bounds can be very tight for moderate training sample sizes. To the best of our knowledge, these results provide the tightest known distribution-free error bounds for approximate Bayesian GPC methods, giving a strong learning-theoretical justification for the use of these techniques.

Book ChapterDOI
01 Jan 2003
TL;DR: The notion of optimal rate of aggregation is defined in an abstract context and lower bounds valid for any method of aggregation are proved, thus establishing optimal rates of linear, convex and model selection type aggregation.
Abstract: We study the problem of aggregation of M arbitrary estimators of a regression function with respect to the mean squared risk Three main types of aggregation are considered: model selection, convex and linear aggregation We define the notion of optimal rate of aggregation in an abstract context and prove lower bounds valid for any method of aggregation We then construct procedures that attain these bounds, thus establishing optimal rates of linear, convex and model selection type aggregation

ReportDOI
TL;DR: The authors argue that policy evaluation can be conducted on the basis of two factors: a policymaker's preferences and the conditional distribution of the outcomes of interest given a policy and available information.
Abstract: It will be remembered that the seventy translators of the Septuagint were shut up in seventy separate rooms with the Hebrew text and brought out with them. when they emerged, seventy identical translations. Would the same miracle be vouchsafed if seventy multiple correlators were shut up with the same statistical material? And anyhow, 1 suppose, if each had a different economist perched on his a priori, that would make a difference to the outcome. (1) THIS PAPER DESCRIBES SOME approaches to macroeconomic policy evaluation in the presence of uncertainty about the structure of the economic environment under study. The perspective we discuss is designed to facilitate policy evaluation for several forms of uncertainty. For example, our approach may be used when an analyst is unsure about the appropriate economic theory that should be assumed to apply, or about the particular functional forms that translate a general theory into a form amenable to statistical analysis. As such, the methods we describe are, we believe, particularly useful in a range of macroeconomic contexts where fundamental disagreements exist as to the determinants of the problem under study. In addition, this approach recognizes that even if economists agree on the underlying economic theory that describes a phenomenon, policy evaluation often requires taking a stance on details of the economic environment, such as lag lengths and functional form, that the theory does not specify. Hence our analysis is motivated by concerns similar to those that led to the development of model calibration methods. Unlike in the usual calibration approach, however, we do not reject formal statistical inference methods but rather incorporate model uncertainty into them. The key intuition underlying our analysis is that, for a broad range of contexts, policy evaluation can be conducted on the basis of two factors: a policymaker's preferences, and the conditional distribution of the outcomes of interest given a policy and available information. What this means is that one of the main objects of interest to scholarly researchers, namely, identification of the true or best model of the economy, is of no intrinsic importance in the policy evaluation context, even though knowledge of this model would, were it available, be very relevant in policy evaluation. Hence model selection, a major endeavor in much empirical macroeconomic research, is not a necessary component of policy evaluation. To the contrary: our argument is that, in many cases, model selection is actually inappropriate, because conditioning policy evaluation on a particular model ignores the role of model uncertainty in the overall uncertainty that surrounds the effects of a given policy choice. This is true both in the sense that many statistical analyses of policies do not systematically evaluate the robustness of policies across different model specifications, and in the sense that many analyses fail to adequately account for the effects of model selection on statistical inference. In contrast, we advocate the use of model averaging methods, which represent a formal way through which one can avoid policy evaluation that is conditional on a particular economic model. From a theoretical perspective, model uncertainty has important implications for the evaluation of policies. This was originally recognized in William Brainard's classic analysis, (2) where model uncertainty occurs in the sense that the effects of a policy on a macroeconomic outcome of interest are unknown, but may be described by the distribution of a parameter that measures the marginal effect of the policy on the outcome. Much of what we argue in terms of theory may be interpreted as a generalization of Brainard's original framework and associated insights to a broader class of model uncertainty. An additional advantage of our approach is that it provides a firm foundation for integrating empirical analysis with policy evaluation. …

01 Jan 2003
TL;DR: Under general conditions, the optimality results now show that the corresponing cross-validation selector performs asymptotically exactly as well as the selector which for each given data set makes the best choice (knowing the true full data distribution).
Abstract: In Part I of this article we propose a general cross-validation criterian for selecting among a collection of estimators of a particular parameter of interest based on n i.i.d. observations. It is assumed that the parameter of interest minimizes the expectation (w.r.t. to the distribution of the observed data structure) of a particular loss function of a candidate parameter value and the observed data structure, possibly indexed by a nuisance parameter. The proposed cross-validation criterian is defined as the empirical mean over the validation sample of the loss function at the parameter estimate based on the training sample, averaged over random splits of the observed sample. The cross-validation selector is now the estimator which minimizes this cross-validation criterion. We illustrate that this general methodology covers, in particular, the selection problems in the current literature, but results in a wide range of new selection methods. We prove a finite sample oracle inequality, and asymptotic optimality of the cross-validated selector under general conditions. The asymptotic optimality states that the cross-validation selector performs asymptotically exactly as well as the selector which for each given data set makes the best choice (knowing the true data generating distribution). Our general framework allows, in particular, the situation in which the observed data structure is a censored version of the full data structure of interest, and where the parameter of interest is a parameter of the full data structure distribution. As examples of the parameter of the full data distribution we consider a density of (a part of) the full data structure, a conditional expectation of an outcome, given explanatory variables, a marginal survival function of a failure time, and multivariate conditional expectation of an outcome vector, given covariates. In part II of this article we show that the general estimating function methodology for censored data structures as provided in van der Laan, Robins (2002) yields the wished loss functions for the selection among estimators of a full-data distribution parameter of interest based on censored data. The corresponding cross-validation selector generalizes any of the existing selection methods in regression and density estimation (including model selection) to the censored data case. Under general conditions, our optimality results now show that the corresponing cross-validation selector performs asymptotically exactly as well as the selector which for each given data set makes the best choice (knowing the true full data distribution). In Part III of this article we propose a general estimator which is defined as follows. For a collection of subspaces and the complete parameter space, one defines an epsilon-net (i.e., a finite set of points whose epsilon-spheres cover the complete parameter space). For each epsilon and subspace one defines now a corresponding minimum cross-valided empirical risk estimator as the minimizer of cross-validated risk over the subspace-specific epsilon-net. In the special case that the loss function has no nuisance parameter, which thus covers the classical regression and density estimation cases, this epsilon and subspace specific minimum risk estimator reduces to the minimizer of the empirical risk over the corresponding epsilon-net. Finally, one selects epsilon and the subspace with the cross-validation selector. We refer to the resulting estimator as the cross-validated adaptive epsilon-net estimator. We prove an oracle inequality for this estimator which implies that the estimator minimax adaptive in the sense that it achieves the minimax optimal rate of convergence for the smallest of the guessed subspaces containing the true parameter value. Cross-Validation for Estimator Selection 1 Stating the Selection Problem. Let O1, . . . , On be n i.i.d. observations of O ∼ P0, where P0 is known to be an element of a statistical model M. Let ψ0(·) = ψ(· | P0) be a parameter (function) of P0 of interest. Let the parameter set for this parameter be Ψ = {ψ(· | P ) : P ∈ M}. Let (O,ψ) → L(O,ψ | η0) ∈ IR be a “loss function”, possibly depending on a nuisance parameter η0 = η(P0), which maps a candidate parameter value ψ and observation O into a real number, whose expectation is minimized at ψ0: ψ0 = argminψ∈Ψ ∫ L(o, ψ | η0)dP0(o) (1) = argminψ∈ΨE0L(O,ψ | η0). Let Pn be the empirical distribution of O1, . . . , On. Let ψk(·) = ψk(· | Pn) ∈ Ψ, k = 1, . . . , K(n), be a collection of estimators (i.e., algorithms one can apply to data) of ψ0(·). The choice of loss function. Different choices of loss functions can satisfy (1). In fact, (1) can define a class of possible loss functions. Different choices of loss functions result in estimators of ψ0 with different behavior. Consequently, the choice of loss function is an interesting issue to be addressed. We suggest the following reasonable strategy for selecting a loss function. Firstly, among the loss functions identifying ψ0 as the minimizer of its risk (i.e., satisfying (1), one wishes to choose a loss function which identifies the wished measure of performance/Risk θ(ψ | P0) ≡ ∫ L(O,ψ | η0)dP0(O) for a candidate ψ ∈ Ψ. Identifying such a function θ(ψ | P0) on the parameter set Ψ does still not uniquely identify the loss function L(O,ψ | η0). Secondly, given this function θ(ψ | P0), we now wish to choose the loss function so that for a locally consistent estimator ηn of η0, 1/n ∑ i L(Oi, ψ | ηn) is a locally efficient estimator of θ(ψ | P0). That is, let L(O,ψ | η0) be a parametrization of

Posted ContentDOI
TL;DR: In this paper, the authors examine the roles played by the propensity score (probability of selection) in matching, instrumental variable and control functions methods and contrast the roles of exclusion restrictions in matching and selection models, characterizing the sensitivity of matching to the choice of conditioning variables and demonstrating the greater robustness of control function methods to misspecification of the conditioning variables.
Abstract: This paper investigates four topics. (1) It examines the different roles played by the propensity score (probability of selection) in matching, instrumental variable and control functions methods. (2) It contrasts the roles of exclusion restrictions in matching and selection models. (3) It characterizes the sensitivity of matching to the choice of conditioning variables and demonstrates the greater robustness of control function methods to misspecification of the conditioning variables. (4) It demonstrates the problem of choosing the conditioning variables in matching and the failure of conventional model selection criteria when candidate conditioning variables are not exogenous.

Journal ArticleDOI
TL;DR: Two modifications are suggested: namely, a new index, based on the simulation error, is employed as the regressor selection criterion and a pruning mechanism is introduced in the model selection algorithm, which is shown to be effective in the identification of compact and robust models.
Abstract: Classical prediction error approaches for the identification of non-linear polynomial NARX/NARMAX models often yield unsatisfactory results for long-range prediction or simulation purposes, mainly due to incorrect or redundant model structure selection. The paper discusses some limitations of the standard approach and suggests two modifications: namely, a new index, based on the simulation error, is employed as the regressor selection criterion and a pruning mechanism is introduced in the model selection algorithm. The resulting algorithm is shown to be effective in the identification of compact and robust models, generally yielding model structures closer to the correct ones. Computational issues are also discussed. Finally, the identification algorithm is tested on a long-range prediction benchmark application.

Posted Content
TL;DR: It is shown that under suitable conditions the IC method will be consistent for the best approximating model among the candidate models, while under standard assumptions the SOOS method will select over-parameterized models with positive probability, resulting in excessive finite-sample PMSEs.
Abstract: It is standard in applied work to select forecasting models by ranking candidate models by their prediction mean squared error (PMSE) in simulated out-of-sample (SOOS) forecasts. Alternatively, forecast models may be selected using information criteria (IC). We compare the asymptotic and finite-sample properties of these methods in terms of their ability to mimimize the true out-of-sample PMSE, allowing for possible misspecification of the forecast models under consideration. We first study a covariance stationary environment. We show that under suitable conditions the IC method will be consistent for the best approximating model among the candidate models. In contrast, under standard assumptions the SOOS method will select over-parameterized models with positive probability, resulting in excessive finite-sample PMSEs. We also show that in the presence of unmodelled structural change both methods will be inadmissible in the sense that they may select a model with strictly higher PMSE than the best approximating model among the candidate models.

Journal ArticleDOI
TL;DR: In this article, the authors employ a two-stage model selection procedure for the S&P 500 index and India's NSE-50 index at the 95% and 99% levels.
Abstract: Value-at-Risk (VaR) is widely used as a tool for measuring the market risk of asset portfolios. However, alternative VaR implementations are known to yield fairly different VaR forecasts. Hence, every use of VaR requires choosing among alternative forecasting models. This paper undertakes two case studies in model selection, for the S&P 500 index and India's NSE-50 index, at the 95% and 99% levels. We employ a two-stage model selection procedure. In the first stage we test a class of models for statistical accuracy. If multiple models survive rejection with the tests, we perform a second stage filtering of the surviving models using subjective loss functions. This two-stage model selection procedure does prove to be useful in choosing a VaR model, while only incompletely addressing the problem. These case studies give us some evidence about the strengths and limitations of present knowledge on estimation and testing for VaR. Copyright © 2003 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: In this paper, an extension to the case of SVMs with quadratic slack penalties is given and a simple approximation for the evidence is derived, which can be used as a criterion for model selection.

Journal ArticleDOI
TL;DR: In this paper, the authors apply the model confidence sets (MCS) procedure to a set of volatility models, analogous to a confidence interval of parameter in the sense that the former contains the best forecasting model with a certain probability.
Abstract: This paper applies the model confidence sets (MCS) procedure to a set of volatility models. A MSC is analogous to a confidence interval of parameter in the sense that the former contains the best forecasting model with a certain probability. The key to the MCS is that it acknowledges the limitations of the information in the data. The empirical exercise is based on fifty-five volatility models, and the MCS includes about a third of these when evaluated by mean square error, whereas the MCS contains only a VGARCH model when mean absolute deviation criterion is used. We conduct a simulation study that shows the MCS captures the superior models across a range of significance levels. When we benchmark the MCS relative to a Bonferroni bound, this bound delivers inferior performance.

Journal ArticleDOI
TL;DR: In this paper, the mean and covariance structures in terms of three polynomial functions of time were modeled and compared using regressogram estimation and a global search of the model space.
Abstract: We exploit a reparameterisation of the marginal covariance matrix arising in longitudinal studies (Pourahmadi, 1999, 2000) to model, jointly, the mean and covariance structures in terms of three polynomial functions of time.By reanalysing Kenward's (1987) cattle data, we compare model selection procedures based on regressogram estimation with these based on a global search of the model space. Using a BIC-based model selection criterion to identify the optimum degree triple of the three polynomials, we show that the use of a saturated mean model is not optimal and explain why regressogram-based model estimation may be misleading. We also suggest a new computational method for finding the global optimum based on a criterion involving three pairwise saturated profile likelihoods.

Journal ArticleDOI
TL;DR: In this article, the role of a set of variables as leading indicators for Euro-area inflation and GDP growth is evaluated using both recursive and rolling estimation, and three different approaches to combine the information from several indicators.
Abstract: In this paper we evaluate the role of a set of variables as leading indicators for Euro-area inflation and GDP growth. Our evaluation is based on using the variables in the ECB Euro-area model database, plus a set of similar variables for the US. We compare the forecasting performance of each indicator with that of purely autoregressive models, using an evaluation procedure that is particularly relevant for policy making. The evaluation is conducted both expost and in a pseudo real time context, for several forecast horizons, and using both recursive and rolling estimation. We also analyze three different approaches to combining the information from several indicators. First, we discuss the use as indicators of the estimated factors from a dynamic factor model for all the indicators. Second, an automated model selection procedure is applied to models with a large set of indicators. Third, we consider pooling the single indicator forecasts. The results indicate that single indicator forecasts are on average better than those derived from more complicated methods, but for them to beat the autoregression a different indicator has to be used in each period. A simple real-time procedure for indicator-selection produces good results.

Journal ArticleDOI
TL;DR: It turns out that the construction of the proposed multivariate time series model allows easy maximum likelihood estimation and construction of well-mixing Markov chain Monte Carlo (MCMC) algorithms.
Abstract: Summary A new multivariate time series model with time varying conditional variances and covariances is presented and analysed. A complete analysis of the proposed model is presented consisting of parameter estimation, model selection and volatility prediction. Classical and Bayesian techniques are used for the estimation of the model parameters. It turns out that the construction of our proposed model allows easy maximum likelihood estimation and construction of well-mixing Markov chain Monte Carlo (MCMC) algorithms. Bayesian model selection is addressed using MCMC model composition. The problem of accounting for model uncertainty is considered using Bayesian model averaging. We provide implementation details and illustrations using daily rates of return on eight stocks of the US market.

Journal ArticleDOI
TL;DR: A Bayesian framework is introduced to carry out Automatic Relevance Determination (ARD) in feedforward neural networks to model censored data and the regularised neural network is more conservative than the default stepwise forward selection procedure implemented by SPSS with the Akaike Information Criterion.

Journal ArticleDOI
TL;DR: The method of averaging over the entire model set uses Akaike coefficients as measures of an individual model's likelihood, which reduces the “generalization error,” the error introduced when the model selected over a particular data set is applied to different conditions.
Abstract: This article deals with the problem of model selection for the mathematical description of tracer kinetics in nuclear medicine. It stems from the consideration of some specific data sets where different models have similar performances. In these situations, it is shown that considerate averaging of a parameter's estimates over the entire model set is better than obtaining the estimates from one model only. Furthermore, it is also shown that the procedure of averaging over a small number of "good" models reduces the "generalization error," the error introduced when the model selected over a particular data set is applied to different conditions, such as subject populations with altered physiologic parameters, modified acquisition protocols, and different signal-to-noise ratios. The method of averaging over the entire model set uses Akaike coefficients as measures of an individual model's likelihood. To facilitate the understanding of these statistical tools, the authors provide an introduction to model selection criteria and a short technical treatment of Akaike's information-theoretic approach. The new method is illustrated and epitomized by a case example on the modeling of [11C]flumazenil kinetics in the brain, containing both real and simulated data.

Journal ArticleDOI
TL;DR: This work extends the classical algorithms of Valiant and Haussler for learning compact conjunctions and disjunctions of Boolean attributes to allow features that are constructed from the data and to allow a trade-off between accuracy and complexity.
Abstract: We extend the classical algorithms of Valiant and Haussler for learning compact conjunctions and disjunctions of Boolean attributes to allow features that are constructed from the data and to allow a trade-off between accuracy and complexity. The result is a general-purpose learning machine, suitable for practical learning tasks, that we call the set covering machine. We present a version of the set covering machine that uses data-dependent balls for its set of features and compare its performance with the support vector machine. By extending a technique pioneered by Littlestone and Warmuth, we bound its generalization error as a function of the amount of data compression it achieves during training. In experiments with real-world learning tasks, the bound is shown to be extremely tight and to provide an effective guide for model selection.

Journal ArticleDOI
TL;DR: In this article, a multi-equation regression model with a diagonal first-order stationary vector autoregresson (VAR) was proposed for modeling and forecasting intraday electricity load.
Abstract: The advent of wholesale electricity markets has brought renewed focus on intraday electricity load forecasting. This article proposes a multi-equation regression model with a diagonal first-order stationary vector autoregresson (VAR) for modeling and forecasting intraday electricity load. The correlation structure of the disturbances to the VAR and the appropriate subset of regressors are explored using Bayesian model selection methodology. The full spectrum of finite-sample inference is obtained using a Bayesian Markov chain Monte Carlo sampling scheme. This includes the predictive distribution of load and the distribution of the time and level of daily peak load, something that is difficult to obtain with other methods of inference. The method is applied to several multiequation models of half-hourly total system load in New South Wales, Australia. A detailed model based on 3 years of data reveals trend, seasonal, bivariate temperature/humidity, and serial correlation components that all vary intraday, ...

Journal ArticleDOI
TL;DR: The results demonstrate the practical advantages of VC-based model selection; it consistently outperforms AIC for all data sets and proposes a new practical estimate of model complexity for k-nearest neighbors regression.
Abstract: We discuss empirical comparison of analytical methods for model selection. Currently, there is no consensus on the best method for finite-sample estimation problems, even for the simple case of linear estimators. This article presents empirical comparisons between classical statistical methods--Akaike information criterion (AIC) and Bayesian information criterion (BIC)--and the structural risk minimization (SRM) method, based on Vapnik-Chervonenkis (VC) theory, for regression problems. Our study is motivated by empirical comparisons in Hastie, Tibshirani, and Friedman (2001), which claims that the SRM method performs poorly for model selection and suggests that AIC yields superior predictive performance. Hence, we present empirical comparisons for various data sets and different types of estimators (linear, subset selection, and k-nearest neighbor regression). Our results demonstrate the practical advantages of VC-based model selection; it consistently outperforms AIC for all data sets. In our study, SRM and BIC methods show similar predictive performance. This discrepancy (between empirical results obtained using the same data) is caused by methodological drawbacks in Hastie et al. (2001), especially in their loose interpretation and application of SRM method. Hence, we discuss methodological issues important for meaningful comparisons and practical application of SRM method. We also point out the importance of accurate estimation of model complexity (VC-dimension) for empirical comparisons and propose a new practical estimate of model complexity for k-nearest neighbors regression.