scispace - formally typeset
Search or ask a question

Showing papers on "Model selection published in 1998"


Book
01 Jan 1998
TL;DR: The first € price and the £ and $ price are net prices, subject to local VAT, and the €(D) includes 7% for Germany, the€(A) includes 10% for Austria.
Abstract: The first € price and the £ and $ price are net prices, subject to local VAT. Prices indicated with * include VAT for books; the €(D) includes 7% for Germany, the €(A) includes 10% for Austria. Prices indicated with ** include VAT for electronic products; 19% for Germany, 20% for Austria. All prices exclusive of carriage charges. Prices and other details are subject to change without notice. All errors and omissions excepted. K.P. Burnham, D.R. Anderson Model Selection and Multimodel Inference

4,406 citations


Book
01 Nov 1998
TL;DR: Information theory and log-likelihood models - a basis for model selection and inference practical use of the information theoretic approach model selection uncertainty with examples Monte Carlo insights and extended examples statistical theory.
Abstract: Information theory and log-likelihood models - a basis for model selection and inference practical use of the information theoretic approach model selection uncertainty with examples Monte Carlo insights and extended examples statistical theory.

4,340 citations


Posted Content
TL;DR: In this paper, the problem of estimating the number of break dates in a linear model with multiple structural changes has been studied and an efficient algorithm to obtain global minimizers of the sum of squared residuals has been proposed.
Abstract: In a recent paper, Bai and Perron (1998) considered theoretical issues related to the limiting distribution of estimators and test statistics in the linear model with multiple structural changes. In this companion paper, we consider practical issues for the empirical applications of the procedures. We first address the problem of estimation of the break dates and present an efficient algorithm to obtain global minimizers of the sum of squared residuals. This algorithm is based on the principle of dynamic programming and requires at most least-squares operations of order O(T 2) for any number of breaks. Our method can be applied to both pure and partial structural-change models. Secondly, we consider the problem of forming confidence intervals for the break dates under various hypotheses about the structure of the data and the errors across segments. Third, we address the issue of testing for structural changes under very general conditions on the data and the errors. Fourth, we address the issue of estimating the number of breaks. We present simulation results pertaining to the behavior of the estimators and tests in finite samples. Finally, a few empirical applications are presented to illustrate the usefulness of the procedures. All methods discussed are implemented in a GAUSS program available upon request for non-profit academic use.

3,836 citations


Book
29 Jan 1998
TL;DR: In this paper, the authors present an inventory of continuous and discrete time-ruiner models for complete and modified data sets, as well as a comprehensive inventory of discrete and continuous distributions for complete data sets.
Abstract: Preface. Acknowledgments. PART I: INTRODUCTION. 1. Modeling. PART II: ACTUARIAL MODELS. 2. Random Variables. 3. Basic Distributional Quantities. 4. Classifying and Creating Distributions. 5. Frequency and Severity with Coverage Modifications. 6. Aggregate Loss Models. 7. Discrete Time Ruin Models. 8. Continuous Time Ruin Models. PART III: CONSTRUCTION OF EMPIRICAL MODELS. 9. Review of Mathematical Statistics. 10. Estimation for Complete Data. 11. Estimation for Modified Data. PART IV: PARAMETRIC STATISTICAL METHODS. 12. Parameter Estimation. 13. Model Selection. 14. Five Examples. PART V: ADJUSTED ESTIMATES AND SIMULATION. 15. Interpolation and Smoothing. 16. Credibility. 17. Simulation. Appendix A: An Inventory of Continuous Distributions. Appendix B: An Inventory of Discrete Distributions. Appendix C: Frequency and Severity Relationships. Appendix D: The Recursive Formula. Appendix E: Discretization of the Serverity Distribution. Appendix F: Numerical Optimization and Solution of Systems. References. Index.

1,276 citations


Journal ArticleDOI
TL;DR: The normalized maximized likelihood, mixture, and predictive codings are each shown to achieve the stochastic complexity to within asymptotically vanishing terms.
Abstract: We review the principles of minimum description length and stochastic complexity as used in data compression and statistical modeling. Stochastic complexity is formulated as the solution to optimum universal coding problems extending Shannon's basic source coding theorem. The normalized maximized likelihood, mixture, and predictive codings are each shown to achieve the stochastic complexity to within asymptotically vanishing terms. We assess the performance of the minimum description length criterion both from the vantage point of quality of data compression and accuracy of statistical inference. Context tree modeling, density estimation, and model selection in Gaussian linear regression serve as examples.

1,140 citations


Journal ArticleDOI
TL;DR: A Bayesian approach for finding classification and regression tree (CART) models by having the prior induce a posterior distribution that will guide the stochastic search toward more promising CART models.
Abstract: In this article we put forward a Bayesian approach for finding classification and regression tree (CART) models. The two basic components of this approach consist of prior specification and stochastic search. The basic idea is to have the prior induce a posterior distribution that will guide the stochastic search toward more promising CART models. As the search proceeds, such models can then be selected with a variety of criteria, such as posterior probability, marginal likelihood, residual sum of squares or misclassification rates. Examples are used to illustrate the potential superiority of this approach over alternative methods.

749 citations


Proceedings Article
24 Jul 1998
TL;DR: This paper extends Structural EM to deal directly with Bayesian model selection and proves the convergence of the resulting algorithm and shows how to apply it for learning a large class of probabilistic models, including Bayesian networks and some variants thereof.
Abstract: In recent years there has been a flurry of works on learning Bayesian networks from data. One of the hard problems in this area is how to effectively learn the structure of a belief network from incomplete data--that is, in the presence of missing values or hidden variables. In a recent paper, I introduced an algorithm called Structural EM that combines the standard Expectation Maximization (EM) algorithm, which optimizes parameters, with structure search for model selection. That algorithm learns networks based on penalized likelihood scores, which include the BIC/MDL score and various approximations to the Bayesian score. In this paper, I extend Structural EM to deal directly with Bayesian model selection. I prove the convergence of the resulting algorithm and show how to apply it for learning a large class of probabilistic models, including Bayesian networks and some variants thereof.

637 citations


Journal ArticleDOI
TL;DR: The concept of GDF offers a unified framework under which complex and highly irregular modeling procedures can be analyzed in the same way as classical linear models and many difficult problems can be solved easily.
Abstract: In the theory of linear models, the concept of degrees of freedom plays an important role. This concept is often used for measurement of model complexity, for obtaining an unbiased estimate of the error variance, and for comparison of different models. I have developed a concept of generalized degrees of freedom (GDF) that is applicable to complex modeling procedures. The definition is based on the sum of the sensitivity of each fitted value to perturbation in the corresponding observed value. The concept is nonasymptotic in nature and does not require analytic knowledge of the modeling procedures. The concept of GDF offers a unified framework under which complex and highly irregular modeling procedures can be analyzed in the same way as classical linear models. By using this framework, many difficult problems can be solved easily. For example, one can now measure the number of observations used in a variable selection process. Different modeling procedures, such as a tree-based regression and a ...

525 citations


Journal ArticleDOI
TL;DR: This work presents three model selection criteria, using information theoretic entropy in the spirit of the minimum description length principle, based on the principle of indifference combined with the maximum entropy principle, thus keeping external model assumptions to a minimum.

403 citations


Journal ArticleDOI
TL;DR: A Bayesian-based methodology is presented which automatically penalizes overcomplex models being fitted to unknown data and is able to select an "optimal" number of components in the model and so partition data sets.
Abstract: A Bayesian-based methodology is presented which automatically penalizes overcomplex models being fitted to unknown data. We show that, with a Gaussian mixture model, the approach is able to select an "optimal" number of components in the model and so partition data sets. The performance of the Bayesian method is compared to other methods of optimal model selection and found to give good results. The methods are tested on synthetic and real data sets.

319 citations


Journal ArticleDOI
TL;DR: In this article, the authors compare properties of parameter estimators under Akaike information criterion (AIC) and consistent AIC (CAIC) model selection in a nested sequence of open population capture-recapture models.
Abstract: Summary We compare properties of parameter estimators under Akaike information criterion (AIC) and 'consistent' AIC (CAIC) model selection in a nested sequence of open population capture-recapture models. These models consist of product multinomials, where the cell probabilities are parameterized in terms of survival ( ) and capture ( p ) i i probabilities for each time interval i . The sequence of models is derived from 'treatment' effects that might be (1) absent, model H ; (2) only acute, model H ; or (3) acute and 0 2 p chronic, lasting several time intervals, model H . Using a 35 factorial design, 1000 3 repetitions were simulated for each of 243 cases. The true number of parameters ranged from 7 to 42, and the sample size ranged from approximately 470 to 55 000 per case. We focus on the quality of the inference about the model parameters and model structure that results from the two selection criteria. We use achieved confidence interval coverage as an integrating metric to judge what constitutes a ...

Proceedings ArticleDOI
12 May 1998
TL;DR: This work develops a termination criterion for the hierarchical clustering methods which optimizes the Bayesian information criterion in a greedy fashion, and demonstrates that the BIC criterion is able to choose the number of clusters according to the intrinsic complexity present in the data.
Abstract: One difficult problem we are often faced with in clustering analysis is how to choose the number of clusters. We propose to choose the number of clusters by optimizing the Bayesian information criterion (BIC), a model selection criterion in the statistics literature. We develop a termination criterion for the hierarchical clustering methods which optimizes the BIC criterion in a greedy fashion. The resulting algorithms are fully automatic. Our experiments on Gaussian mixture modeling and speaker clustering demonstrate that the BIC criterion is able to choose the number of clusters according to the intrinsic complexity present in the data.

Journal ArticleDOI
TL;DR: This paper identified 29 models that the literature suggests are appropriate for technological forecasting and divided them into three classes according to the timing of the point of inflexion in the innovation or substitution process.
Abstract: The paper identifies 29 models that the literature suggests are appropriate for technological forecasting. These models are divided into three classes according to the timing of the point of inflexion in the innovation or substitution process. Faced with a given data set and such a choice, the issue of model selection needs to be addressed. Evidence used to aid model selection is drawn from measures of model fit and model stability. An analysis of the forecasting performance of these models using simulated data sets shows that it is easier to identify a class of possible models rather than the "best" model. This leads to the combining of model forecasts. The performance of the combined forecasts appears promising with a tendency to outperform the individual component models.

Journal ArticleDOI
TL;DR: In this paper, the Akaike information criterion (AIC) and the AIC-based model selection criteria are applied to check for difference in T4 cell counts between two disease groups.
Abstract: When testing for a treatment effect or a difference among groups, the distributional assumptions made about the response variable can have a critical impact on the conclusions drawn. For example, controversy has arisen over transformations of the response (Keene). An alternative approach is to use some member of the family of generalized linear models. However, this raises the issue of selecting the appropriate member, a problem of testing non-nested hypotheses. Standard model selection criteria, such as the Akaike information criterion (AIC), can be used to resolve problems. These procedures for comparing generalized linear models are applied to checking for difference in T4 cell counts between two disease groups. We conclude that appropriate model selection criteria should be specified in the protocol for any study, including clinical trials, in order that optimal inferences can be drawn about treatment differences.

Journal ArticleDOI
TL;DR: The generalised F mixture model can relax the usual stronger distributional assumptions and allow the analyst to uncover structure in the data that might otherwise have been missed, illustrated by fitting the model to data from large-scale clinical trials with long follow-up of lymphoma patients.
Abstract: Cure rate estimation is an important issue in clinical trials for diseases such as lymphoma and breast cancer and mixture models are the main statistical methods. In the last decade, mixture models under different distributions, such as exponential, Weibull, log-normal and Gompertz, have been discussed and used. However, these models involve stronger distributional assumptions than is desirable and inferences may not be robust to departures from these assumptions. In this paper, a mixture model is proposed using the generalized F distribution family. Although this family is seldom used because of computational difficulties, it has the advantage of being very flexible and including many commonly used distributions as special cases. The generalised F mixture model can relax the usual stronger distributional assumptions and allow the analyst to uncover structure in the data that might otherwise have been missed. This is illustrated by fitting the model to data from large-scale clinical trials with long follow-up of lymphoma patients. Computational problems with the model and model selection methods are discussed. Comparison of maximum likelihood estimates with those obtained from mixture models under other distributions are included.

Proceedings Article
01 Dec 1998
TL;DR: In this procedure model selection and learning are not separate, but kernels are dynamically adjusted during the learning process to find the kernel parameter which provides the best possible upper bound on the generalisation error.
Abstract: The kernel-parameter is one of the few tunable parameters in Support Vector machines, controlling the complexity of the resulting hypothesis. Its choice amounts to model selection and its value is usually found by means of a validation set. We present an algorithm which can automatically perform model selection with little additional computational cost and with no need of a validation set. In this procedure model selection and learning are not separate, but kernels are dynamically adjusted during the learning process to find the kernel parameter which provides the best possible upper bound on the generalisation error. Theoretical results motivating the approach and experimental results confirming its validity are presented.

Journal ArticleDOI
TL;DR: This paper derives maximum a posteriori (MAP) rules for several different families of competing models and obtain forms that are similar to AIC and naive MDL, but for some families, however, it is found that the derived penalties are different.
Abstract: The two most popular model selection rules in signal processing literature have been Akaike's (1974) criterion (AIC) and Rissanen's (1978) principle of minimum description length (MDL). These rules are similar in form in that they both consist of data and penalty terms. Their data terms are identical, but the penalties are different, MDL being more stringent toward overparameterization. AIC penalizes for each additional model parameter with an equal incremental amount of penalty, regardless of the parameter's role in the model, In most of the literature on model selection, MDL appears in a form that also suggests equal penalty for every unknown parameter. This MDL criterion, we refer to as naive MDL. In this paper, we show that identical penalization for every parameter is not appropriate and that the penalty has to depend on the model structure and type of model parameters. The approach to showing this is Bayesian, and it relies on large sample theory. We derive maximum a posteriori (MAP) rules for several different families of competing models and obtain forms that are similar to AIC and naive MDL. For some families, however, we find that the derived penalties are different. In those cases, our extensive simulations show that the MAP rule outperforms AIC and naive MDL.

Journal ArticleDOI
TL;DR: In this paper, a finite mixed Poisson regression model with covariates in both Poisson rates and mixing probabilities is used to analyze the relationship between patents and research and development spending at the firm level.
Abstract: Count-data models are used to analyze the relationship between patents and research and development spending at the firm level, accounting for overdispersion using a finite mixed Poisson regression model with covariates in both Poisson rates and mixing probabilities. Maximum likelihood estimation using the EM and quasi-Newton algorithms is discussed. Monte Carlo studies suggest that (a) penalized likelihood criteria are a reliable basis for model selection and can be used to determine whether continuous or finite support for the mixing distribution is more appropriate and (b) when the mixing distribution is incorrectly specified, parameter estimates remain unbiased but have inflated variances.

Proceedings Article
01 Jan 1998
TL;DR: In this paper, the authors proposed a Bayesian approach to extract structural information from remote-sensing images by selecting from a library of priori models those which best explain the structures within an image.
Abstract: Automatic interpretation of remote-sensing (RS) images and the growing interest for query by image content from large remote-sensing image archives rely on the ability and robustness of information extraction from observed data. In Parts I and II of this article, we turn the attention to the modern Bayesian way of thinking and introduce a pragmatic approach to extract structural information from RS images by selecting from a library of priori models those which best explain the structures within an image. Part I introduces the Bayesian approach and defines the information extraction as a two-level procedure: 1) model fitting, which is the incertitude alleviation over the model parameters, and 2) model selection, which is the incertitude alleviation over the class of models. The superiority of the Bayesian results is commented from an information theoretical perspective. The theoretical assay concludes with the proposal of a new systematic method for scene understanding from RS images: search for the scene that best explains the observed data. The method is demonstrated for high accuracy restoration of synthetic aperture radar (SAR) images with emphasis on new optimization algorithms for simultaneous model selection and parameter estimation. Examples are given for three families of Gibbs random fields (GRF) used as prior model libraries. Part II expands in detail on the information extraction using GRF's at one and at multiple scales. Based on the Bayesian approach, a new method for optimal joint scale and model selection is demonstrated. Examples are given using a nested family of GRF's utilized as prior models for information extraction with applications both to SAR and optical images.

Journal ArticleDOI
TL;DR: This paper presents a statistical framework for detecting degeneracies of a geometric model by evaluating its predictive capability in terms of the expected residual and derive the geometric AIC, which allows us to detect singularities in a structure-from-motion analysis without introducing any empirically adjustable thresholds.
Abstract: In building a 3-D model of the environment from image and sensor data, one must fit to the data an appropriate class of models, which can be regarded as a parametrized manifold, or geometric model, defined in the data space. In this paper, we present a statistical framework for detecting degeneracies of a geometric model by evaluating its predictive capability in terms of the expected residual and derive the geometric AIC. We show that it allows us to detect singularities in a structure-from-motion analysis without introducing any empirically adjustable thresholds. We illustrate our approach by simulation examples. We also discuss the application potential of this theory for a wide range of computer vision and robotics problems.

Journal ArticleDOI
TL;DR: The authors introduce a pragmatic approach to extract structural information from RS images by selecting from a library of a priori models those which best explain the structures within an image.
Abstract: Automatic interpretation of remote-sensing (RS) images and the growing interest for query by image content from large remote-sensing image archives rely on the ability and robustness of information extraction from observed data. In Parts I and II of this article, the authors turn the attention to the modern Bayesian way of thinking and introduce a pragmatic approach to extract structural information from RS images by selecting from a library of a priori models those which best explain the structures within an image. Part I introduces the Bayesian approach and defines the information extraction as a two-level procedure: 1) model fitting, which is the incertitude alleviation over the model parameters, and 2) model selection, which is the incertitude alleviation over the class of models. The superiority of the Bayesian results is commented from an information theoretical perspective. The theoretical assay concludes with the proposal of a new systematic method for scene understanding from RS images: search for the scene that best explains the observed data. The method is demonstrated for high accuracy restoration of synthetic aperture radar (SAR) images with emphasis on new optimization algorithms for simultaneous model selection and parameter estimation. Examples are given for three families of Gibbs random fields (GRF) used as prior model libraries. Based on the Bayesian approach, a new method for optimal joint scale and model selection is demonstrated. Examples are given using a nested family of GRFs utilized as prior models for information extraction with applications both to SAR and optical images.

Journal ArticleDOI
TL;DR: It is shown that the optimal rate of convergence is simultaneously achieved for log-densities in Sobolev spaces W/sub 2//sup s/(U) without knowing the smoothness parameter s and norm parameter U in advance.
Abstract: Probability models are estimated by use of penalized log-likelihood criteria related to Akaike (1973) information criterion (AIC) and minimum description length (MDL). The accuracies of the density estimators are shown to be related to the tradeoff between three terms: the accuracy of approximation, the model dimension, and the descriptive complexity of the model classes. The asymptotic risk is determined under conditions on the penalty term, and is shown to be minimax optimal for some cases. As an application, we show that the optimal rate of convergence is simultaneously achieved for log-densities in Sobolev spaces W/sub 2//sup s/(U) without knowing the smoothness parameter s and norm parameter U in advance. Applications to neural network models and sparse density function estimation are also provided.

Journal ArticleDOI
TL;DR: The connections of the alternative model for mixture of experts (ME) to the normalized radial basis function (NRBF) nets and extended normalized RBF (ENRBF) nets are established, and the well-known expectation-maximization (EM) algorithm for maximum likelihood learning is suggested to the two types of RBF nets.

Proceedings Article
01 Dec 1998
TL;DR: A data-driven method to select on a query-by-query basis the optimal number of neighbors to be considered for each prediction, and a local combination of the most promising models is explored.
Abstract: Lazy learning is a memory-based technique that, once a query is received, extracts a prediction interpolating locally the neighboring examples of the query which are considered relevant according to a distance measure. In this paper we propose a data-driven method to select on a query-by-query basis the optimal number of neighbors to be considered for each prediction. As an efficient way to identify and validate local models, the recursive least squares algorithm is introduced in the context of local approximation and lazy learning. Furthermore, beside the winner-takes-all strategy for model selection, a local combination of the most promising models is explored. The method proposed is tested on six different datasets and compared with a state-of-the-art approach.

Journal ArticleDOI
TL;DR: This paper pursues three objectives in the context of multiple regression models: to give a rationale for model selection criteria which combine a badness of fit term with a measure of complexity of a model.

Journal ArticleDOI
TL;DR: One-sided cross-validation (OSCV) as mentioned in this paper is a prequential model selection method for nonparametric regression estimators that selects the smoothing parameters of non-parametric regressors.
Abstract: A new method of selecting the smoothing parameters of nonparametric regression estimators is introduced. The method, termed one-sided cross-validation (OSCV), has the objectivity of cross-validation and statistical properties comparable to those of a plug-in rule. The new method may be viewed as an application of the prequential model selection method of Dawid. As such, our results identify a situation in which the prequential method is a more efficient model selector than cross-validation. An example, simulations, and theoretical results demonstrate the utility of OSCV when used with local linear and kernel estimators.

Journal ArticleDOI
TL;DR: In this article, it is shown that the conditions required to justify use of the weak-heredity principle are so restrictive as to make It unusable in practice, and an example is given to Illustrate this.
Abstract: Model selection under the weak-heredity principle allows models that contain compound terms such as x 1 x 2 to have only one of the corresponding x 1 and x 2 terms In the model It is shown that the conditions required to justify use of the principle are so restrictive as to make It unusable in practice. An example is given to Illustrate this.

Journal ArticleDOI
TL;DR: The authors showed that the number of separately identifiable parameters in a capture-recapture model is equal to the rank of the Hessian matrix (second derivatives of the maximum likelihood relative to the parameters).
Abstract: Capture-recapture models are a powerful tool for estimating and comparing survival probabilities among groups of individuals in wild animal populations. One of the remaining problems is the calculation of the number of independently estimable parameters in the models, which is necessary in using model selection tools such as Likelihood ratio tests or the Akaike's Information Criterion. We show that the number of separately identifiable parameters in a model is equal to the rank of the Hessian matrix (second derivatives of the maximum likelihood relative to the parameters). We present the numerical problems involved in computing the Hessian and its numerical rank, and we apply the technique to data on nesting swifts (Apus apus).

Journal ArticleDOI
TL;DR: A model selection approach for the specification of the cointegrating rank in the VECM representation of VAR models is proposed and asymptotic properties of estimates are derived and their features compared with the traditional likelihood ratio based approach.

Journal ArticleDOI
TL;DR: In this paper, three approaches to modelling spatial data in which simulation plays a vital role are described and illustrated with examples. But none of these approaches are appropriate for binary data, and none of them are suitable for the analysis of spatio-temporal data.
Abstract: Three approaches to modelling spatial data in which simulation plays a vital role are described and illustrated with examples. The first approach uses flexible regression models, such as generalized additive models, together with locational covariates to fit a surface to spatial data. We show how the bootstrap can be used to quantify the effects of model selection uncertainty and to avoid oversmoothing. The second approach, which is appropriate for binary data, allows for local spatial correlation by the inclusion in a logistic regression model of a covariate derived from neighbouring values of the response variable. The resulting autologistic model can be fitted to survey data obtained from a random sample of sites by incorporating the Gibbs sampler into the modelling procedure. We show how this modelling strategy can be used not only to fit the autologistic model to sites included in the survey, but also to estimate the probability that a certain species is present in the unsurveyed sites. Our third approach relates to the analysis of spatio-temporal data. Here we model the distribution of a plant or animal species as a function of the distribution at an earlier time point. The bootstrap is used to estimate parameters and quantify their precision. © 1998 John Wiley & Sons, Ltd.