scispace - formally typeset
Search or ask a question

Showing papers on "Cross-validation published in 2004"


Journal ArticleDOI
TL;DR: An extensive simulation study has been performed comparing cross-validation, resubstitution and bootstrap estimation for three popular classification rules-linear discriminant analysis, 3-nearest-neighbor and decision trees (CART)-using both synthetic and real breast-cancer patient data.
Abstract: Motivation: Microarray classification typically possesses two striking attributes: (1) classifier design and error estimation are based on remarkably small samples and (2) cross-validation error estimation is employed in the majority of the papers. Thus, it is necessary to have a quantifiable understanding of the behavior of cross-validation in the context of very small samples. Results: An extensive simulation study has been performed comparing cross-validation, resubstitution and bootstrap estimation for three popular classification rules---linear discriminant analysis, 3-nearest-neighbor and decision trees (CART)---using both synthetic and real breast-cancer patient data. Comparison is via the distribution of differences between the estimated and true errors. Various statistics for the deviation distribution have been computed: mean (for estimator bias), variance (for estimator precision), root-mean square error (for composition of bias and variance) and quartile ranges, including outlier behavior. In general, while cross-validation error estimation is much less biased than resubstitution, it displays excessive variance, which makes individual estimates unreliable for small samples. Bootstrap methods provide improved performance relative to variance, but at a high computational cost and often with increased bias (albeit, much less than with resubstitution). Availability and Supplementary information: A companion web site can be accessed at the URL http://ee.tamu.edu/~edward/cv_paper. The companion web site contains: (1) the complete set of tables and plots regarding the simulation study; (2) additional figures; (3) a compilation of references for microarray classification studies and (4) the source code used, with full documentation and examples.

598 citations


Journal ArticleDOI
TL;DR: A Rao–Blackwell type of relation is derived in which nonparametric methods such as cross-validation are seen to be randomized versions of their covariance penalty counterparts.
Abstract: Having constructed a data-based estimation rule, perhaps a logistic regression or a classification tree, the statistician would like to know its performance as a predictor of future cases. There are two main theories concerning prediction error: (1) penalty methods such as Cp, Akaike's information criterion, and Stein's unbiased risk estimate that depend on the covariance between data points and their corresponding predictions; and (2) cross-validation and related nonparametric bootstrap techniques. This article concerns the connection between the two theories. A Rao–Blackwell type of relation is derived in which nonparametric methods such as cross-validation are seen to be randomized versions of their covariance penalty counterparts. The model-based penalty methods offer substantially better accuracy, assuming that the model is believable.

465 citations


Journal ArticleDOI
TL;DR: In this paper, the authors used the neural network method to estimate foliar biochemical concentrations from remote sensing data of eucalypt tree canopies, which are more complex than are spectra from many coniferous canyons and much more complex from dried ground leaves.

335 citations


Journal ArticleDOI
TL;DR: It is shown that exact leave-one-out cross-validation of sparse Least-Squares Support Vector Machines (LS-SVMs) can be implemented with a computational complexity of only O(ln2) floating point operations, rather than the O(l2n2) operations of a naïve implementation.

325 citations


Journal ArticleDOI
01 Apr 2004
TL;DR: An efficient construction algorithm for obtaining sparse linear-in-the-weights regression models based on an approach of directly optimizing model generalization capability is introduced by utilizing the delete-1 cross validation concept and the associated leave-one-out test error.
Abstract: The paper introduces an efficient construction algorithm for obtaining sparse linear-in-the-weights regression models based on an approach of directly optimizing model generalization capability. This is achieved by utilizing the delete-1 cross validation concept and the associated leave-one-out test error also known as the predicted residual sums of squares (PRESS) statistic, without resorting to any other validation data set for model evaluation in the model construction process. Computational efficiency is ensured using an orthogonal forward regression, but the algorithm incrementally minimizes the PRESS statistic instead of the usual sum of the squared training errors. A local regularization method can naturally be incorporated into the model selection procedure to further enforce model sparsity. The proposed algorithm is fully automatic, and the user is not required to specify any criterion to terminate the model construction procedure. Comparisons with some of the existing state-of-art modeling methods are given, and several examples are included to demonstrate the ability of the proposed algorithm to effectively construct sparse models that generalize well.

246 citations


Journal ArticleDOI
TL;DR: In this paper, the authors compared several competing mean squared error of prediction (MSEP) estimators on principal components regression (PCR) and partial least squares regression (PLSR): leave-one-out crossvalidation, K-fold and adjusted k-fold cross-validation.
Abstract: The paper presents results from simulations based on real data, comparing several competing mean squared error of prediction (MSEP) estimators on principal components regression (PCR) and partial least squares regression (PLSR): leave-one-out crossvalidation, K-fold and adjusted K-fold cross-validation, the ordinary bootstrap estimate, the bootstrap smoothed cross-validation (BCV) estimate and the 0.632 bootstrap estimate. The overall performance of the estimators is compared in terms of their bias, variance and squared error. The results indicate that the 0.632 estimate and leave-one-out crossvalidation are preferable when one can afford the computation. Otherwise adjusted 5- or 10-fold cross-validation are good candidates because of their computational efficiency.

222 citations


Journal ArticleDOI
TL;DR: A finite sample result is established for a general class of likelihood-based cross-validation procedures (as indexed by the type of sample used, e.g. V-fold cross- validation) that implies that the cross- validation selector performs asymptotically as a benchmark model selector which is optimal for each given dataset and depends on the true density.
Abstract: Likelihood-based cross-validation is a statistical tool for selecting a density estimate based on n i.i.d. observations from the true density among a collection of candidate density estimators. General examples are the selection of a model indexing a maximum likelihood estimator, and the selection of a bandwidth indexing a nonparametric (e.g. kernel) density estimator. In this article, we establish a finite sample result for a general class of likelihood-based cross-validation procedures (as indexed by the type of sample splitting used, e.g. V-fold cross-validation). This result implies that the cross-validation selector performs asymptotically as well (w.r.t. to the Kullback-Leibler distance to the true density) as a benchmark model selector which is optimal for each given dataset and depends on the true density. Crucial conditions of our theorem are that the size of the validation sample converges to infinity, which excludes leave-one-out cross-validation, and that the candidate density estimates are bounded away from zero and infinity. We illustrate these asymptotic results and the practical performance of likelihood-based cross-validation for the purpose of bandwidth selection with a simulation study. Moreover, we use likelihood-based cross-validation in the context of regulatory motif detection in DNA sequences.

218 citations


Journal ArticleDOI
TL;DR: SORB outperforms the two other ANN algorithms, the well known Multi-layer Feedforward Network (MFN) and Self-Organizing Linear Output map (SOLO) neural network for simulation of daily streamflow in the semi-arid Salt River basin.

200 citations


Journal ArticleDOI
TL;DR: In this article, a maximum likelihood version (MLBMA) of BMA is applied to seven alternative variogram models of log air permeability data from single-hole pneumatic injection tests in six boreholes at the Apache Leap Research Site (ALRS) in central Arizona.
Abstract: [1] Hydrologic analyses typically rely on a single conceptual-mathematical model. Yet hydrologic environments are open and complex, rendering them prone to multiple interpretations and mathematical descriptions. Adopting only one of these may lead to statistical bias and underestimation of uncertainty. Bayesian model averaging (BMA) [Hoeting et al., 1999] provides an optimal way to combine the predictions of several competing models and to assess their joint predictive uncertainty. However, it tends to be computationally demanding and relies heavily on prior information about model parameters. Neuman [2002, 2003] proposed a maximum likelihood version (MLBMA) of BMA to render it computationally feasible and to allow dealing with cases where reliable prior information is lacking. We apply MLBMA to seven alternative variogram models of log air permeability data from single-hole pneumatic injection tests in six boreholes at the Apache Leap Research Site (ALRS) in central Arizona. Unbiased ML estimates of variogram and drift parameters are obtained using adjoint state maximum likelihood cross validation [Samper and Neuman, 1989a] in conjunction with universal kriging and generalized least squares. Standard information criteria provide an ambiguous ranking of the models, which does not justify selecting one of them and discarding all others as is commonly done in practice. Instead, we eliminate some of the models based on their negligibly small posterior probabilities and use the rest to project the measured log permeabilities by kriging onto a rock volume containing the six boreholes. We then average these four projections and associated kriging variances, using the posterior probability of each model as weight. Finally, we cross validate the results by eliminating from consideration all data from one borehole at a time, repeating the above process and comparing the predictive capability of MLBMA with that of each individual model. We find that MLBMA is superior to any individual geostatistical model of log permeability among those we consider at the ALRS.

196 citations


Journal ArticleDOI
TL;DR: A combinational feature selection method in conjunction with ensemble neural networks is introduced to generally improve the accuracy and robustness of sample classification and help to extract the latent marker genes of the diseases for better diagnosis and treatment.
Abstract: Microarray experiments are becoming a powerful tool for clinical diagnosis, as they have the potential to discover gene expression patterns that are characteristic for a particular disease. To date, this problem has received most attention in the context of cancer research, especially in tumor classification. Various feature selection methods and classifier design strategies also have been generally used and compared. However, most published articles on tumor classification have applied a certain technique to a certain dataset, and recently several researchers compared these techniques based on several public datasets. But, it has been verified that differently selected features reflect different aspects of the dataset and some selected features can obtain better solutions on some certain problems. At the same time, faced with a large amount of microarray data with little knowledge, it is difficult to find the intrinsic characteristics using traditional methods. In this paper, we attempt to introduce a combinational feature selection method in conjunction with ensemble neural networks to generally improve the accuracy and robustness of sample classification. We validate our new method on several recent publicly available datasets both with predictive accuracy of testing samples and through cross validation. Compared with the best performance of other current methods, remarkably improved results can be obtained using our new strategy on a wide range of different datasets. Thus, we conclude that our methods can obtain more information in microarray data to get more accurate classification and also can help to extract the latent marker genes of the diseases for better diagnosis and treatment.

161 citations


Journal ArticleDOI
TL;DR: The results indicate the proposed method vastly improves on resubstitution and cross-validation, especially for small samples, in terms of bias and variance, while being tens to hundreds of times faster.

Journal ArticleDOI
TL;DR: Adjustments are introduced for two of the characteristic values produced by a progressive scrambling analysis -- the deprecated predictivity and standard error of prediction (SDEPs*) -- that correct for the effect of introduced perturbation.
Abstract: The two methods most often used to evaluate the robustness and predictivity of partial least squares (PLS) models are cross-validation and response randomization. Both methods may be overly optimistic for data sets that contain redundant observations, however. The kinds of perturbation analysis widely used for evaluating model stability in the context of ordinary least squares regression are only applicable when the descriptors are independent of each other and errors are independent and normally distributed; neither assumption holds for QSAR in general and for PLS in particular. Progressive scrambling is a novel, nonparametric approach to perturbing models in the response space in a way that does not disturb the underlying covariance structure of the data. Here, we introduce adjustments for two of the characteristic values produced by a progressive scrambling analysis - the deprecated predictivity (Q*2s) and standard error of prediction (SDEPs*) - that correct for the effect of introduced perturbation. We also explore the statistical behavior of the adjusted values (Q*2(0) and SDEP0*) and the sensitivity to perturbation (dq2/dryy'2). It is shown that the three statistics are all robust for stable PLS models, in terms of the stochastic component of their determination and of their variation due to sampling effects involved in training set selection.

Journal ArticleDOI
TL;DR: In this article, the authors considered the problem of how to choose the bandwidth parameter in practice, and proposed a bootstrap procedure to estimate the optimal bandwidth, and showed its consistency.
Abstract: In this paper we consider kernel estimation of a density when the data are contaminated by random noise. More specifically we deal with the problem of how to choose the bandwidth parameter in practice. A theoretical optimal bandwidth is defined as the minimizer of the mean integrated squared error. We propose a bootstrap procedure to estimate this optimal bandwidth, and show its consistency. These results remain valid for the case of no measurement error, and hence also summarize part of the theory of bootstrap bandwidth selection in ordinary kernel density estimation. The finite sample performance of the proposed bootstrap selection procedure is demonstrated with a simulation study. An application to a real data example illustrates the use of the method.

Journal ArticleDOI
TL;DR: Several encoding schemes, including orthogonal matrix, hydrophobicity matrix, BLOSUM62 substitution matrix, and combined matrix of these, are applied and optimized to improve the prediction accuracy in this study.
Abstract: Prediction of protein secondary structures is an important problem in bioinformatics and has many applications. The recent trend of secondary structure prediction studies is mostly based on the neural network or the support vector machine (SVM). The SVM method is a comparatively new learning system which has mostly been used in pattern recognition problems. In this study, SVM is used as a machine learning tool for the prediction of secondary structure and several encoding schemes, including orthogonal matrix, hydrophobicity matrix, BLOSUM62 substitution matrix, and combined matrix of these, are applied and optimized to improve the prediction accuracy. Also, the optimal window length for six SVM binary classifiers is established by testing different window sizes and our new encoding scheme is tested based on this optimal window size via sevenfold cross validation tests. The results show 2% increase in the accuracy of the binary classifiers when compared with the instances in which the classical orthogonal matrix is used. Finally, to combine the results of the six SVM binary classifiers, a new tertiary classifier which combines the results of one-versus-one binary classifiers is introduced and the performance is compared with those of existing tertiary classifiers. According to the results, the Q/sub 3/ prediction accuracy of new tertiary classifier reaches 78.8% and this is better than the best result reported in the literature.

Journal ArticleDOI
TL;DR: In this article, the authors compared several statistical approaches that combine a simple AVHRR split window algorithm with ground meterological station observations in the prediction of air temperature, along with their non-spatial counterparts (multiple linear regressions).
Abstract: Ground station temperature data are not commonly used simultaneously with the Advanced Very High Resolution Radiometer (AVHRR) to model and predict air temperature or land surface temperature. Technology was developed to acquire near-synchronous datasets over a 1 000 000 km2 region with the goal of improving the measurement of air temperature at the surface. This study compares several statistical approaches that combine a simple AVHRR split window algorithm with ground meterological station observations in the prediction of air temperature. Three spatially dependent (kriging) models were examined, along with their non-spatial counterparts (multiple linear regressions). Cross-validation showed that the kriging models predicted temperature better (an average of 0.9°C error) than the multiple regression models (an average of 1.4°C error). The three different kriging strategies performed similarly when compared to each other. Errors from kriging models were unbiased while regression models tended to give bia...

Journal ArticleDOI
TL;DR: This model is consistent with prior domain experience suggesting that a relatively small number of genes, taken in different combinations, is required to predict most clinical classes of interest, and is more appropriate for the gene expression domain than other structurally similar Bayesian network classification models.
Abstract: We present new techniques for the application of a Bayesian network learning framework to the problem of classifying gene expression data. The focus on classification permits us to develop techniques that address in several ways the complexities of learning Bayesian nets. Our classification model reduces the Bayesian network learning problem to the problem of learning multiple subnetworks, each consisting of a class label node and its set of parent genes. We argue that this classification model is more appropriate for the gene expression domain than are other structurally similar Bayesian network classification models, such as Naive Bayes and Tree Augmented Naive Bayes (TAN), because our model is consistent with prior domain experience suggesting that a relatively small number of genes, taken in different combinations, is required to predict most clinical classes of interest. Within this framework, we consider two different approaches to identifying parent sets which are supported by the gene expression observations and any other currently available evidence. One approach employs a simple greedy algorithm to search the universe of all genes; the second approach develops and applies a gene selection algorithm whose results are incorporated as a prior to enable an exhaustive search for parent sets over a restricted universe of genes. Two other significant contributions are the construction of classifiers from multiple, competing Bayesian network hypotheses and algorithmic methods for normalizing and binning gene expression data in the absence of prior expert knowledge. Our classifiers are developed under a cross validation regimen and then validated on corresponding out-of-sample test sets. The classifiers attain a classification rate in excess of 90% on out-of-sample test sets for two publicly available datasets. We present an extensive compilation of results reported in the literature for other classification methods run against these same two datasets. Our results are comparable to, or better than, any we have found reported for these two sets, when a train-test protocol as stringent as ours is followed.

Journal ArticleDOI
TL;DR: A model-based approach is used to compare the ranking performances of resubstitution and cross-validation for classification based on real-valued feature sets and for prediction in the context of probabilistic Boolean networks (PBNs).
Abstract: Motivation: Ranking gene feature sets is a key issue for both phenotype classification, for instance, tumor classification in a DNA microarray experiment, and prediction in the context of genetic regulatory networks. Two broad methods are available to estimate the error (misclassification rate) of a classifier. Resubstitution fits a single classifier to the data, and applies this classifier in turn to each data observation. Cross-validation (in leave-one-out form) removes each observation in turn, constructs the classifier, and then computes whether this leave-one-out classifier correctly classifies the deleted observation. Resubstitution typically underestimates classifier error, severely so in many cases. Cross-validation has the advantage of producing an effectively unbiased error estimate, but the estimate is highly variable. In many applications it is not the misclassification rate per se that is of interest, but rather the construction of gene sets that have the potential to classify or predict. Hence, one needs to rank feature sets based on their performance. Results: A model-based approach is used to compare the ranking performances of resubstitution and cross-validation for classification based on real-valued feature sets and for prediction in the context of probabilistic Boolean networks (PBNs). For classification, a Gaussian model is considered, along with classification via linear discriminant analysis and the 3-nearest-neighbor classification rule. Prediction is examined in the steady-distribution of a PBN. Three metrics are proposed to compare feature-set ranking based on error estimation with ranking based on the true error, which is known owing to the model-based approach. In all cases, resubstitution is competitive with cross-validation relative to ranking accuracy. This is in addition to the enormous savings in computation time afforded by resubstitution.

Journal ArticleDOI
TL;DR: A model based on the skew Gaussian distribution is presented to handle skewed spatial data and compares favorably with several kriging variants in the spatial prediction of weekly rainfall.

Journal ArticleDOI
TL;DR: This paper proposes a method whereby height–diameter regression from an inventory can be incorporated into a height imputation algorithm, implying that substantial environmental variation existed in the height¬diameter relati...
Abstract: This paper proposes a method whereby height–diameter regression from an inventory can be incorporated into a height imputation algorithm. Point-level subsampling is often employed in forest inventory for efficiency. Some trees will be measured for diameter and species, while others will be measured for height and 10-year increment. Predictions of these missing measures would be useful for estimating volume and growth, respectively, so they are often imputed. We present and compare three imputation strategies: using a published model, using a localized version of a published model, and using best linear unbiased predictions from a mixed-effects model. The bases of our comparison are four-fold: minimum fitted root mean squared error and minimum predicted root mean squared error under a 2000-fold cross-validation for tree-level height and volume imputations. In each case the mixed-effects model proved superior. This result implies that substantial environmental variation existed in the height–diameter relati...

Journal ArticleDOI
TL;DR: A bootstrap procedure for selecting the bandwidth parameters in a nonparametric two-step estimation method results in a fully data-driven procedure for estimating a finite (but possibly unknown) number of changepoints in a regression function.
Abstract: Nonparametric estimation of abrupt changes in a regression function involves choosing smoothing (bandwidth) parameters. The performance of estimation procedures depends heavily on this choice. So far, little attention has been paid to the crucial issue of choosing appropriate bandwidth parameters in practice. In this article we propose a bootstrap procedure for selecting the bandwidth parameters in a nonparametric two-step estimation method. This method results in a fully data-driven procedure for estimating a finite (but possibly unknown) number of changepoints in a regression function. We evaluate the performance of the data-driven procedure via a simulation study, which reveals that the fully automatic procedure performs quite well. As an illustration, we apply the procedure to some real data.

Journal ArticleDOI
TL;DR: The March-INSIDE methodology was applied to the prediction of the bitter tasting threshold of 48 dipeptides by means of pattern recognition techniques, in this case linear discriminant analysis (LDA), and regression methods, and yielded a percentage of good classification higher than 80% with the two main families of descriptor generated by this methodology.

Journal ArticleDOI
TL;DR: Flexible parametric models based on the Weibull, loglogistic and lognormal distributions with spline smoothing of the baseline log cumulative hazard function are used to fit a set of candidate prognostic models across k data sets and show that the deviance statistic is able to discriminate between quite similar models and can be used to choose a prognostic model that generalizes well to new data.
Abstract: The process of developing and validating a prognostic model for survival time data has been much discussed in the literature. Assessment of the performance of candidate prognostic models on data other than that used to fit the models is essential for choosing a model that will generalize well to independent data. However, there remain difficulties in current methods of measuring the accuracy of predictions of prognostic models for censored survival time data. In this paper, flexible parametric models based on the Weibull, loglogistic and lognormal distributions with spline smoothing of the baseline log cumulative hazard function are used to fit a set of candidate prognostic models across k data sets. The model that generalizes best to new data is chosen using a cross-validation scheme which fits the model on k-1 data sets and tests the predictive accuracy on the omitted data set. The procedure is repeated, omitting each data set in turn. The quality of the predictions is measured using three different methods: two commonly proposed validation methods, Harrell's concordance statistic and the Brier statistic, and a novel method using deviance differences. The results show that the deviance statistic is able to discriminate between quite similar models and can be used to choose a prognostic model that generalizes well to new data. The methods are illustrated by using a model developed to predict progression to a new AIDS event or death in HIV-1 positive patients starting antiretroviral therapy.

Journal ArticleDOI
TL;DR: The results indicate that logic regression used in combination with cluster/group operating binding site identification methods or with the proposed method MFURE is a powerful and flexible alternative to linear regression based motif finding methods.
Abstract: Motivation: Multiple transcription factors coordinately control transcriptional regulation of genes in eukaryotes. Although many computational methods consider the identification of individual transcription factor binding sites (TFBSs), very few focus on the interactions between these sites. We consider finding TFBSs and their context specific interactions using microarray gene expression data. We devise a hybrid approach called LogicMotif composed of a TFBS identification method combined with the new regression methodology logic regression. LogicMotif has two steps: First, potential binding sites are identified from transcription control regions of genes of interest. Various available methods can be used in this step when the genes of interest can be divided into groups such as up-and downregulated. For this step, we also develop a simple univariate regression and extension method MFURE to extract candidate TFBSs from a large number of genes in the availability of microarray gene expression data. MFURE provides an alternative method for this step when partitioning of the genes into disjoint groups is not preferred. This first step aims to identify individual sites within gene groups of interest or sites that are correlated with the gene expression outcome. In the second step, logic regression is used to build a predictive model of outcome of interest (either gene expression or up- and down-regulation) using these potential sites. This 2-fold approach creates a rich diverse set of potential binding sites in the first step and builds regression or classification models in the second step using logic regression that is particularly good at identifying complex interactions. Results: LogicMotif is applied to two publicly available datasets. A genome-wide gene expression data set of Saccharomyces cerevisiae is used for validation. The regression models obtained are interpretable and the biological implications are in agreement with the known resuts. This analysis suggests that LogicMotif provides biologically more reasonable regression models than previous analysis of this dataset with standard linear regression methods. Another dataset of S.cerevisiae illustrates the use of LogicMotif in classification questions by building a model that discriminates between up- and down-regulated genes in iron copper deficiency. LogicMotif identifies an inductive and two repressor motifs in this dataset. The inductive motif matches the binding site of the transcription factor Aft1p that has a key role in regulation of the uptake process. One of the novel repressor sites is highly present in transcription control regions of FeS genes. This site could represent a TFBS for an unknown transcription factor involved in repression of genes encoding FeS proteins in iron deficiency. We establish the robustness of the method to the type of outcome variable used by considering both continuous and binary outcome variables for this dataset. Our results indicate that logic regression used in combination with cluster/group operating binding site identification methods or with our proposed method MFURE is a powerful and flexible alternative to linear regression based motif finding methods. Availability: Source code for logic regression is freely available as a package of the R programming language by Ruczinski et al. (2003) and can be downloaded at http://bear.fhcrc.org/~ingor/logic/download/download.html. An R package for MFURE is available at http://www.stat.berkeley.edu/~sunduz/software.html.

Journal ArticleDOI
TL;DR: A validation protocol is presented briefly and two of the tools which are part of this protocol are introduced in more detail, which can be used to determine the complexity and with it the stability of models generated by variable selection.
Abstract: Variable selection is applied frequently in QSAR research. Since the selection process influences the characteristics of the finally chosen model, thorough validation of the selection technique is very important. Here, a validation protocol is presented briefly and two of the tools which are part of this protocol are introduced in more detail. The first tool, which is based on permutation testing, allows to assess the inflation of internal figures of merit (such as the cross-validated prediction error). The other tool, based on noise addition, can be used to determine the complexity and with it the stability of models generated by variable selection. The obtained statistical information is important in deciding whether or not to trust the predictive abilities of a specific model. The graphical output of the validation tools is easily accessible and provides a reliable impression of model performance. Among others, the tools were employed to study the influence of leave-one-out and leave-multiple-out cross-validation on model characteristics. Here, it was confirmed that leave-multiple-out cross-validation yields more stable models. To study the performance of the entire validation protocol, it was applied to eight different QSAR data sets with default settings. In all cases internal and external model performance was good, indicating that the protocol serves its purpose quite well.

Journal ArticleDOI
TL;DR: Overall, partial least squares discriminant analysis (PLS-DA) and support vector machines (SVM) outperform all other methods and are suggested a practical approach to take advantage of multiple methods in biomarker applications.
Abstract: An attractive application of expression technologies is to predict drug efficacy or safety using expression data of biomarkers. To evaluate the performance of various classification methods for building predictive models, we applied these methods on six expression datasets. These datasets were from studies using microarray technologies and had either two or more classes. From each of the original datasets, two subsets were generated to simulate two scenarios in biomarker applications. First, a 50-gene subset was used to simulate a candidate gene approach when it might not be practical to measure a large number of genes/biomarkers. Next, a 2000-gene subset was used to simulate a whole genome approach. We evaluated the relative performance of several classification methods by using leave-one-out cross-validation and bootstrap cross-validation. Although all methods perform well in both subsets for a relative easy dataset with two classes, differences in performance do exist among methods for other d...

Proceedings ArticleDOI
16 Aug 2004
TL;DR: This work empirically evaluates the performance of the Relief-F algorithm on three published cancer classification data sets, and compares it with other feature filtering methods, including Information Gain, Gain Ratio, and /spl chi//sup 2/-statistic.
Abstract: Numerous recent studies have shown that microarray gene expression data is useful for cancer classification. Classification based on microarray data is very different from previous classification problems in that the number of features (genes) greatly exceeds the number of instances (tissue samples). It has been shown that selecting a small set of informative genes can lead to improved classification accuracy. It is thus important to first apply feature selection methods prior to classification. In the machine learning field, one of the most successful feature filtering algorithms is the Relief-F algorithm. In this work, we empirically evaluate its performance on three published cancer classification data sets. We use the linear SVM and the k-NN as classifiers in the experiments, and compare the performance of Relief-F with other feature filtering methods, including Information Gain, Gain Ratio, and /spl chi//sup 2/-statistic. Using the leave-one-out cross validation, experimental results show that the performance of Relief-F is comparable with other methods.

Journal ArticleDOI
TL;DR: In this paper, the authors show that the modern analogue technique using a similarity index (SIMMAX) and the revised analogue method (RAM), both derived from the modern analog technique, achieve apparently lower root mean square error of prediction (RMSEP) by failing to ensure statistical independence of samples during cross validation.
Abstract: [1] In the quest for more precise sea-surface temperature reconstructions from microfossil assemblages, large modern training sets and new transfer function methods have been developed. Realistic estimates of the predictive power of a transfer function can only be calculated from an independent test set. If the test set is not fully independent, the error estimate will be artificially low. We show that the modern analogue technique using a similarity index (SIMMAX) and the revised analogue method (RAM), both derived from the modern analogue technique, achieve apparently lower root mean square error of prediction (RMSEP) by failing to ensure statistical independence of samples during cross validation. We also show that when cross validation is used to select the best artificial neural network or modern analogue model, the RMSEP based on cross validation is lower than that for a fully independent test set.

Journal ArticleDOI
TL;DR: In this paper, a random varying-coefficient model for longitudinal data is proposed, where the time- varying coefficients are assumed to be subject-specific, and can be considered as realizations of stochastic processes.
Abstract: In this paper, we propose a random varying-coefficient model for longitudinal data. This model is different from the standard varying-coefficient model in the sense that the time- varying coefficients are assumed to be subject-specific, and can be considered as realizations of stochastic processes. This modelling strategy allows us to employ powerful mixed-effects modelling techniques to efficiently incorporate the within-subject and between-subject variations in the esti- mators of time-varying coefficients. Thus, the subject-specific feature of longitudinal data is effectively considered in the proposed model. A backfitting algorithm is proposed to estimate the coefficient functions. Simulation studies show that the proposed estimation methods are more efficient in finite-sample performance compared with the standard local least squares method. An application to an AIDS clinical study is presented to illustrate the proposed methodologies.

Journal ArticleDOI
TL;DR: In this paper, the authors used a smoothing spline regression to estimate the light curve given a period and then found the period which minimizes the generalized cross-validation (GCV) score.
Abstract: Summary. The objective is to estimate the period and the light curve (or periodic function) of a variable star. Previously, several methods have been proposed to estimate the period of a variable star, but they are inaccurate especially when a data set contains outliers. We use a smoothing spline regression to estimate the light curve given a period and then find the period which minimizes the generalized cross-validation (GCV). The GCV method works well, matching an intensive visual examination of a few hundred stars, but the GCV score is still sensitive to outliers. Handling outliers in an automatic way is important when this method is applied in a ‘data mining’ context to a vary large star survey. Therefore, we suggest a robust method which minimizes a robust cross-validation criterion induced by a robust smoothing spline regression. Once the period has been determined, a nonparametric method is used to estimate the light curve. A real example and a simulation study suggest that the robust cross-validation and GCV methods are superior to existing methods.

Journal ArticleDOI
TL;DR: An analysis of measurement invariance in a multigroup confirmatory factor model shows that in cases in which models without mean restrictions are compared to models with restricted means, one should take account of the presence of means, even if the model is saturated with respect to the means.
Abstract: Information fit indexes such as Akaike Information Criterion, Consistent Akaike Information Criterion, Bayesian Information Criterion, and the expected cross validation index can be valuable in assessing the relative fit of structural equation models that differ regarding restrictiveness. In cases in which models without mean restrictions (i.e., saturated mean structure) are compared to models with restricted (i.e., modeled) means, one should take account of the presence of means, even if the model is saturated with respect to the means. The failure to do this can result in an incorrect rank order of models in terms of the information fit indexes. We demonstrate this point by an analysis of measurement invariance in a multigroup confirmatory factor model.