scispace - formally typeset
Search or ask a question

Showing papers on "Cross-validation published in 1995"


Proceedings Article
Ron Kohavi1
20 Aug 1995
TL;DR: The results indicate that for real-word datasets similar to the authors', the best method to use for model selection is ten fold stratified cross validation even if computation power allows using more folds.
Abstract: We review accuracy estimation methods and compare the two most common methods crossvalidation and bootstrap. Recent experimental results on artificial data and theoretical re cults in restricted settings have shown that for selecting a good classifier from a set of classifiers (model selection), ten-fold cross-validation may be better than the more expensive leaveone-out cross-validation. We report on a largescale experiment--over half a million runs of C4.5 and a Naive-Bayes algorithm--to estimate the effects of different parameters on these algrithms on real-world datasets. For crossvalidation we vary the number of folds and whether the folds are stratified or not, for bootstrap, we vary the number of bootstrap samples. Our results indicate that for real-word datasets similar to ours, The best method to use for model selection is ten fold stratified cross validation even if computation power allows using more folds.

11,185 citations


Proceedings Article
27 Nov 1995
TL;DR: A constructive, incremental learning system for regression problems that models data by means of locally linear experts that does not compete for data during learning and derives asymptotic results for this method.
Abstract: We introduce a constructive, incremental learning system for regression problems that models data by means of locally linear experts. In contrast to other approaches, the experts are trained independently and do not compete for data during learning. Only when a prediction for a query is required do the experts cooperate by blending their individual predictions. Each expert is trained by minimizing a penalized local cross validation error using second order methods. In this way, an expert is able to find a local distance metric by adjusting the size and shape of the receptive field in which its predictions are valid, and also to detect relevant input features by adjusting its bias on the importance of individual input dimensions. We derive asymptotic results for our method. In a variety of simulations the properties of the algorithm are demonstrated with respect to interference, learning speed, prediction accuracy, feature detection, and task oriented incremental learning.

249 citations


Journal ArticleDOI
TL;DR: In this article, a cross-validation method is proposed, compared with a range of other methods and found to be an improvement when the actual risk is close to constant, and theoretical and empirical comparisons demonstrate the advantage of choosing the smoothing parameters jointly.
Abstract: Estimation of a relative risk function using a ratio of two kernel density estimates is considered, concentrating on the problem of choosing the smoothing parameters. A cross-validation method is proposed, compared with a range of other methods and found to be an improvement when the actual risk is close to constant. In particular, theoretical and empirical comparisons demonstrate the advantage of choosing the smoothing parameters jointly. The methodology was motivated by a class of problems in environmental epidemiology, and an application in this area is described.

179 citations


Journal ArticleDOI
TL;DR: The PRESS statistic and associated residuals do not require the data to be split, yield alternative unbiased estimates of R2 and SEE, and provide useful case diagnostics.
Abstract: In the health science literature, a common approach of validating a regression equation is data-splitting, where a portion of the data fits the model (fitting sample) and the remainder (validation sample) estimates future performance. The R2 and SEE obtained by predicting the validation sample with the fitting sample equation is a proper estimate of future performance, tending to correct for the natural upward bias of the R2 and SEE obtained from fitting sample alone. Data-splitting has several disadvantages, however. These include: 1) difficulty, arbitrariness, and inconvenience of matching samples; 2) the need to report two sets of statistics to determine homogeneity; and 3) the lack of equation stability due to diluted sample size. The PRESS statistic and associated residuals do not require the data to be split, yield alternative unbiased estimates of R2 and SEE, and provide useful case diagnostics. This procedure is easy to use, is widely available in modern statistical packages, but is rarely utilized. The two methods are contrasted here using a simulation from original data for predicting body density from anthropometric measurements of a group of 117 women. The PRESS approach is particularly appropriate for smaller datasets; methods of reporting these statistics are recommended.

141 citations


Journal ArticleDOI
TL;DR: In this paper, the asymptotic distributions of estimators of global integral functionals of the regression surface were derived for nonparametric regression with multiple random predictor variables, and the results were applied to the problem of obtaining reliable estimators for the non-parametric coefficient of determination, which is also called Pearson's correlation ratio.
Abstract: In a nonparametric regression setting with multiple random predictor variables, we give the asymptotic distributions of estimators of global integral functionals of the regression surface. We apply the results to the problem of obtaining reliable estimators for the nonparametric coefficient of determination. This coefficient, which is also called Pearson's correlation ratio, gives the fraction of the total variability of a response that can be explained by a given set of covariates. It can be used to construct measures of nonlinearity of regression and relative importance of subsets of regressors, and to assess the validity of other model restrictions. In addition to providing asymptotic results, we propose several data-based bandwidth selection rules and carry out a Monte Carlo simulation study of finite sample properties of these rules and associated estimators of explanatory power. We also provide two real data examples.

130 citations


Journal ArticleDOI
Michael Kearns1
27 Nov 1995
TL;DR: It is argued that the following qualitative properties of cross-validation behavior should be quite robust to significant changes in the underlying model selection problem: when the target function complexity is small compared to the sample size, the performance of cross validation is relatively insensitive to the choice of .
Abstract: We give a theoretical and experimental analysis of the generalization error of cross validation using two natural measures of the problem under consideration. The approximation rate measures the accuracy to which the target function can be ideally approximated as a function of the number of parameters, and thus captures the complexity of the target function with respect to the hypothesis model. The estimation rate measures the deviation between the training and generalization errors as a function of the number of parameters, and thus captures the extent to which the hypothesis model suffers from overfitting. Using these two measures, we give a rigorous and general bound on the error of the simplest form of cross validation. The bound clearly shows the dangers of making γ —the fraction of data saved for testing—too large or too small. By optimizing the bound with respect to γ, we then argue that the following qualitative properties of cross-validation behavior should be quite robust to significant changes ...

105 citations


Journal ArticleDOI
TL;DR: The equations developed in this study predict BMR more accurately for Chinese adults more accurately than currently available predictive equations, which overestimated BMR for healthy Chinese adults.
Abstract: Objective To develop predictive equations for basal metabolic rate (BMR) in healthy Chinese adults and to evaluate factors that may influence BMR. Design Measuring the BMR of Chinese adults by indirect calorimetry. Selected subjects were randomly assigned to a validation sample and a cross-validation sample. The validation sample was used to develop predictive equations that were cross-validated using the other sample. Subjects Two hundred twenty-three healthy Chinese adults (102 men and 121 women) participated in the study. Their ages ranged from 20 to 78 years old (mean=43.8±14.3 years). Measures BMR was measured by indirect calorimetry. Body composition was assessed by skinfold fat thicknesses, bioelectrical impedance analysis, and urinary creatinine excretion. Statistical analyses performed Student's t test, Pearson correlation coefficients, linear regression analyses, and the best subset were used for statistical analyses. Results The better-fitting equation for predicting BMR in Chinese adults is BMR=13.88 × weight (kg) + 4.16 × height (cm) − 3.43 × age (years) − 112.40 × sex (men=0; women=1) + 54.34. Men had higher BMR values than women because men had greater fat-free mass, body cell mass, and muscle mass. BMR correlated best with body composition, which correlated highly with anthropometric measurements. Therefore, accurate determination of body weight and body height are beneficial in predicting a person's BMR. All of the currently available predictive equations overestimated BMR ( P =.0001) for healthy Chinese adults. The equations developed in this study predict BMR more accurately for Chinese adults. Applications The equations developed in this study are recommended for clinical use in healthy Chinese adults who are within normal limits for body weight. J Am Diet Assoc. 1995; 95:1403-1408.

104 citations


Journal ArticleDOI
TL;DR: This work has shown that generalized cross validation is well suited for the adaptive choice of certain parameters that occur in variational objective analysis and for data assimilation problems that are mathematically equivalent to variational problems.
Abstract: In variational data assimilation, optimal ingestion of the observational data, and optimal use of prior physical and statistical information involve the choice of numerous weighting, smoothing, and tuning parameters that control the filtering and merging of diverse sources of information. Generally these weights must be obtained from a partial and imperfect understanding of various sources of errors and are frequently chosen by a combination of historical information, physical reasoning, and trial and error. Generalized cross validation (GCV) has long been one of the methods of choice for choosing certain tuning, smoothing, regularization parameters in ill-posed inverse problems, smoothing, and filtering problems. In theory, it is well suited for the adaptive choice of certain parameters that occur in variational objective analysis and for data assimilation problems that are mathematically equivalent to variational problems. The main drawback of the use of GCV in data assimilation problems was th...

94 citations


Journal ArticleDOI
TL;DR: It is suggested that general model comparison, model selection, and model probability estimation be performed using the Schwarz criterion, which can be implemented given the model log likelihoods using only a hand calculator.
Abstract: We investigate the performance of empirical criteria for comparing and selecting quantitative models from among a candidate set. A simulation based on empirically observed parameter values is used to determine which criterion is the most accurate at identifying the correct model specification. The simulation is composed of both nested and nonnested linear regression models. We then derive posterior probability estimates of the superiority of the alternative models from each of the criteria and evaluate the relative accuracy, bias, and information content of these probabilities. To investigate whether additional accuracy can be derived from combining criteria, a method for obtaining a joint prediction from combinations of the criteria is proposed and the incremental improvement in selection accuracy considered. Based on the simulation, we conclude that most leading criteria perform well in selecting the best model, and several criteria also produce accurate probabilities of model superiority. Computationally intensive criteria failed to perform better than criteria which were computationally simpler. Also, the use of several criteria in combination failed to appreciably outperform the use of one model. The Schwarz criterion performed best overall in terms of selection accuracy, accuracy of posterior probabilities, and ease of use. Thus, we suggest that general model comparison, model selection, and model probability estimation be performed using the Schwarz criterion, which can be implemented given the model log likelihoods using only a hand calculator.

89 citations


Journal ArticleDOI
TL;DR: This is Part II of a series concerning the PLS kernel algorithm for data sets with many variables and few objects where the issues of cross‐validation and missing data are investigated.
Abstract: This is Part II of a series concerning the PLS kernel algorithm for data sets with many variables and few objects. Here the issues of cross-validation and missing data are investigated. Both partial and full cross-validation are evaluated in terms of predictive residuals and speed and are illustrated on real examples. Two related approaches to the solution of the missing data problem are presented. One is a full EM algorithm and the second a reduced EM algorithm which applies when the number of missing values is small. The two examples are multivariate calibration data sets. The first set consists of UV-visible data measured on mixtures of four metal ions. The second example consists of FT-IR measurements on mixtures consisting of four different organic substances.

56 citations


Journal ArticleDOI
TL;DR: In this article, different estimation methods are compared: regression, regression with residual simple kriging, Kriging with an external drift, and cokriging for correlation under 0.4, and these methods are superior slightly to the other approaches in terms of minimizing estimation error.
Abstract: The problem of estimating a regionalized variable in the presence of other secondary variables is encountered in spatial investigations. Given a context in which the secondary variable is known everywhere (or can be estimated with great precision), different estimation methods are compared: regression, regression with residual simple kriging, kriging, simple kriging with a mean obtained by regression, kriging with an external drift, and cokriging. The study focuses on 19 pairs of regionalized variables from five different datasets representing different domains (geochemical, environmental, geotechnical). The methods are compared by cross-validation using the mean absolute error as criterion. For correlations between the principal and secondary variable under 0.4, similar results are obtained using kriging and cokriging, and these methods are superior slightly to the other approaches in terms of minimizing estimation error. For correlations greater than 0.4, cokriging generally performs better than other methods, with a reduction in mean absolute errors that can reach 46% when there is a high degree of correlation between the variables. Kriging with an external drift or kriging the residuals of a regression (SKR) are almost as precise as cokriging.

Proceedings Article
27 Nov 1995
TL;DR: A constructive, incremental learning system for regression problems that models data by means of locally linear experts that does not compete for data during learning and derives asymptotic results for this method.
Abstract: We introduce a constructive, incremental learning system for regression problems that models data by means of locally linear experts. In contrast to other approaches, the experts are trained independently and do not compete for data during learning. Only when a prediction for a query is required do the experts cooperate by blending their individual predictions. Each expert is trained by minimizing a penalized local cross validation error using second order methods. In this way, an expert is able to find a local distance metric by adjusting the size and shape of the receptive field in which its predictions are valid, and also to detect relevant input features by adjusting its bias on the importance of individual input dimensions. We derive asymptotic results for our method. In a variety of simulations the properties of the algorithm are demonstrated with respect to interference, learning speed, prediction accuracy, feature detection, and task oriented incremental learning.

Proceedings ArticleDOI
10 Jul 1995
TL;DR: A solution to the overfitting problem is proposed that, involves pre-processing the training data and relies on obtaining an increase of spectral coherence of individual training classes by applying k-nearest neighbour filtering.
Abstract: The authors study neural network overfitting on synthetically generated and real remote sensing data. The effect of overfitting is shown by: 1) visualising the shape of the decision boundaries in feature space during the learning process, and 2) by plotting the classification accuracy of independent test sets versus the number of training cycles. A solution to the overfitting problem is proposed that, involves pre-processing the training data. The method relies on obtaining an increase of spectral coherence of individual training classes by applying k-nearest neighbour filtering. Points in feature space with class labels inconsistent with those of the majority of their neighbours are removed. This effectively simplifies the training data, and removes outliers and local inconsistencies. It is shown that using this approach can reduce the overfitting effect and increase the resulting classification accuracy.

Journal ArticleDOI
TL;DR: A seed-propagated sampling approach is proposed that can be used to generate any number of simulated proteins with a desired type based on a given training set database and may provide a more objective estimation for various protein-folding-type prediction methods.
Abstract: In the development of methodology for statistical prediction of protein folding types, how to test the predicted results is a crucial problem. In addition to the resubstitution test in which the folding type of each protein from a training set is predicted based on the rules derived from the same set, cross-validation tests are needed. Among them, the single-testset method seems to be least reliable due to the arbitrariness in selecting the test set. Although the leaving-one-out (or jackknife) test is more objective and hence more reliable, it may cause a severe information loss by leaving a protein in turn out of the training set when its size is not large enough. In order to overcome the above drawback, a seed-propagated sampling approach is proposed that can be used to generate any number of simulated proteins with a desired type based on a given training set database. There is no need to make any predetermined assumption about the statistical distribution function of the amino acid frequencies. Combined with the existing cross-validation methods, the new technique may provide a more objective estimation for various protein-folding-type prediction methods.

Journal ArticleDOI
Yong Liu1
TL;DR: To estimate the generalization error of a model from the training data, the method of cross-validation and the asymptotic form of the jackknife estimator are used and it is found that the average of the predictive errors is used to estimate thegeneralization error.

Journal ArticleDOI
TL;DR: A weighted cross- validation technique, known in the spline literature as generalized cross-validation (GCV), is proposed for covariance model selection and parameter estimation, which provides a simplifying significantly the computation of the cross- validateation mean square error of prediction.
Abstract: A weighted cross-validation technique known in the spline literature as generalized cross-validation (GCV), is proposed for covariance model selection and parameter estimation. Weights for prediction errors are selected to give more importance to a cluster of points than isolated points. Clustered points are estimated better by their neighbors and are more sensitive to model parameters. This rational weighting scheme also provides a simplifying significantly the computation of the cross-validation mean square error of prediction. With small- to medium-size datasets, GCV is performed in a global neighborhood. Optimization of usual isotropic models requires only a small number of matrix inversions. A small dataset and a simulation are used to compare performances of GCV to ordinary cross-validation (OCV) and least-squares filling (LS).

Journal ArticleDOI
TL;DR: In this article, a new approach to discriminant analysis based on projection pursuit density estimation is proposed, where projections are chosen to minimize estimates of the expected overall loss in each projection pursuit stage.

Journal ArticleDOI
TL;DR: A practical method is proposed here by which the trade-off between the mean-square residuals and the signal roughness can incorporate the user's prior knowledge of the spatial characteristics and error characteristics of the signal surface.
Abstract: The thin-plate smoothing spline model is a mathematically elegant method for surface estimations that has been progressively developed over the last decade. A summary description of the method is given. The model smooths the data according to the criterion of minimizing a functional combining the mean-square residuals and the roughness of a signal surface. In the traditional use of the model, the trade-off between the mean-square residuals and the signal roughness is internally estimated by minimizing the general cross validation. However, in the case of meterological and climatological datasets, which are often sparse and noisy, the traditional fitting approach can result in unrealistically smooths maps. To address this, a practical method is proposed here by which the above-mentioned trade-off can incorporate the user's prior knowledge of the spatial characteristics and error characteristics of the signal surface. The approach is illustrated by application to island rainfall datasets for the tr...

Journal ArticleDOI
TL;DR: In this paper, the accuracy of four empirical techniques (simple cross-validation, multi-cross-validations, jackknife, and bootstrap) were investigated in a Monte Carlo study.
Abstract: Empirical techniques to estimate the shrinkage of the sample R2 have been advocated as alternatives to analytical formulae. Although such techniques may be appropriate for estimating the coefficient of cross-validation, they do not provide accurate estimates of the population multiple correlation. The accuracy of four empirical techniques (simple cross-validation, multi-cross-validation, jackknife, and bootstrap) were investigated in a Monte Carlo study. Random samples of size 20 to 200 were drawn from a pseudopopulation of actual field data. Regression models were investigated with population coefficients of determination ranging from .04 to .50 and with numbers of regressors ranging from 2 to 10. Substantial statistical bias was evident when the shrunken R2 values were used to estimate the population squared multiple correlation. Researchers are advised to avoid the empirical techniques when the parameter of interest is the population coefficient of determination rather than the coefficient of cross-val...

Journal ArticleDOI
TL;DR: In this article, an artificial neural network model is used to predict the dynamic coefficient of friction (DCOF) as measured by a slip resistance testing device, and the model predicts the DCOF as a function of six independent variables over a wide range of conditions.
Abstract: This paper describes the formulation, building and validation of an artificial neural network model of the dynamic coefficient of friction (DCOF) as measured by a slip resistance testing device. The model predicts the DCOF as a function of six independent variables over a wide range of conditions. A grouped cross validation method is used to show the consistent performance of the model in predicting the DCOF for new values of the independent variables.

Journal ArticleDOI
TL;DR: The final classification algorithm referred to as APML for approximate penalized maximum likelihood compares favourably in terms of error rate and time efficiency with other algorithms tested, including multinormal, nearest neighbour and convex hull classifiers.
Abstract: SUMMARY A new theoretical point of view is discussed in the framework of density estimation. The multivariate true density, viewed as a prior or penalizing factor in a Bayesian framework, is modelled by a Gibbs potential. Estimating the density consists in maximizing the posterior. For efficiency of time, we are interested in an approximate estimator f = Bwr of the true density f, where B is a stochastic operator and 7r is the raw histogram. Then, we investigate the discrimination problem, introducing an adaptive bandwidth depending on the k nearest neighbours and chosen to optimize the cross-validation criterion. Our final classification algorithm referred to as APML for approximate penalized maximum likelihood compares favourably in terms of error rate and time efficiency with other algorithms tested, including multinormal, nearest neighbour and convex hull classifiers.

Proceedings ArticleDOI
27 Nov 1995
TL;DR: This paper applies a statistical resampling technique, called the bootstrap method, to this estimation problem and shows that the variance of the boot strap estimates can be smaller than those of the cross-validated estimates.
Abstract: We compare the cross-validation and bootstrap methods for estimating the expected error rates of feedforward neural network classifiers in small sample size situations. The cross-validation method, a commonly applied method, provides nearly unbiased classification error rates, using only the original samples. The cross-validated estimates, however, may suffer from a large variance. In this paper, we apply a statistical resampling technique, called the bootstrap method, to this estimation problem and compare the performances of these methods. Our results show that the variance of the bootstrap estimates can be smaller than those of the cross-validated estimates.


Journal ArticleDOI
TL;DR: It is shown that, in terms of the classification error, overfitting does occur for certain representations used to encode the discrete attributes in neural networks.

Journal ArticleDOI
TL;DR: In this article, a generalised cross-validation criterion for linear model selection is proposed and shown to contain the ordinary and the existing generalized crossvalidation criteria as special cases.
Abstract: SUMMARY Cross-validation criteria in linear model selection are approached from a coordinate free point of view. A new generalised cross-validation criterion is derived and shown to contain the ordinary and the existing generalised cross-validation criteria as special cases.

Journal ArticleDOI
Wei-Liem Loh1
TL;DR: In this article, a large sample study of the relationship between the optimal ridge parameter and the population parameters was conducted and the results suggest a new linear adaptive ridge classification procedure which has a simple closed-form expression for the ridge parameter.

Journal ArticleDOI
TL;DR: In this paper, a generalized regression set up with a likelihood-based generalization of usual kernel and nearest-neighbor type smoothing techniques and a related extension of the least-squares leave-one-out cross-validation are explored.

Journal ArticleDOI
15 Nov 1995
TL;DR: In this article, the deflation techniques described by Burrage et al. (1994) were adapted to the problem of minimizing the generalized cross validation (GCV) function, allowing vector and parallel architectures to be exploited in an efficient manner.
Abstract: The fitting of a thin plate smoothing spline to noisy data using the method of minimizing the Generalized Cross Validation (GCV) function is computationally intensive involving the repeated solution of sets of linear systems of equations as part of a minimization routine. In the case of a data set of more than a few hundred points, implementation on a workstation can become unrealistic and it is then desirable to exploit high performance computing. The usual implementation of the GCV algorithm performs Householder reductions to tridiagonalize the influence matrix and then solves a sequence of tridiagonal linear systems which are updated only by a scalar value (the minimization parameter) on the diagonal. However, this approach is not readily parallelizable. In this paper the deflation techniques described by Burrage et al. (1994), which are used to accelerate the convergence of iterative schemes applied to linear systems, will be adapted to the problem of minimizing the GCV function. This approach will allow vector and parallel architectures to be exploited in an efficient manner.

Journal ArticleDOI
TL;DR: It is demonstrated that cross validation is superior to likelihood function maximization and modeling error minimization, since both have bias for over-parameterized models and may not be generally applicable to reliability models with wear and shock variables.

Journal ArticleDOI
TL;DR: A learning model in which Dempster-Shafer's measure was learned by genetic algorithm was constructed, and the main advantage was that its predictive faculty was compensated by Bayesian probabilities.