Showing papers on "Cross-validation published in 2008"

PDF

Open Access

Journal Article•DOI•

[...]

Johan A. Westerhuis¹, Huub C. J. Hoefsloot¹, Suzanne Smit¹, Daniel J. Vis¹, Age K. Smilde¹, Ewoud J. J. van Velzen¹, John P. M. van Duijnhoven, Ferdi A. van Dorsten - Show less +4 more•Institutions (1)

University of Amsterdam¹

24 Jan 2008-Metabolomics

TL;DR: A strategy based on cross model validation and permutation testing to validate the classification models and advocate against the use of PLSDA score plots for inference of class differences is discussed.

...read moreread less

Abstract: Classifying groups of individuals based on their metabolic profile is one of the main topics in metabolomics research. Due to the low number of individuals compared to the large number of variables, this is not an easy task. PLSDA is one of the data analysis methods used for the classification. Unfortunately this method eagerly overfits the data and rigorous validation is necessary. The validation however is far from straightforward. Is this paper we will discuss a strategy based on cross model validation and permutation testing to validate the classification models. It is also shown that too optimistic results are obtained when the validation is not done properly. Furthermore, we advocate against the use of PLSDA score plots for inference of class differences.

...read moreread less

1,216 citations

Journal Article•DOI•

An adaptive inverse-distance weighting spatial interpolation technique

[...]

George Y. Lu¹, David Wong¹•Institutions (1)

George Mason University¹

01 Sep 2008-Computers & Geosciences

TL;DR: Adaptive IDW performs better than the constant parameter method in most cases, and better than ordinary kriging in one of the authors' empirical studies when the spatial structure in the data could not be modeled effectively by typical variogram functions.

...read moreread less

1,002 citations

Journal Article•DOI•

Coordinate descent algorithms for lasso penalized regression

[...]

Tong Tong Wu, Kenneth Lange

01 Mar 2008-The Annals of Applied Statistics

TL;DR: In this article, the authors proposed two algorithms for estimating regression coefficients with a lasso penalty, one based on greedy coordinate descent and another based on Edgeworth's algorithm for ordinary l1 regression.

...read moreread less

Abstract: Imposition of a lasso penalty shrinks parameter estimates toward zero and performs continuous model selection. Lasso penalized regression is capable of handling linear regression problems where the number of predictors far exceeds the number of cases. This paper tests two exceptionally fast algorithms for estimating regression coefficients with a lasso penalty. The previously known l2 algorithm is based on cyclic coordinate descent. Our new l1 algorithm is based on greedy coordinate descent and Edgeworth’s algorithm for ordinary l1 regression. Each algorithm relies on a tuning constant that can be chosen by cross-validation. In some regression problems it is natural to group parameters and penalize parameters group by group rather than separately. If the group penalty is proportional to the Euclidean norm of the parameters of the group, then it is possible to majorize the norm and reduce parameter estimation to l2 regression with a lasso penalty. Thus, the existing algorithm can be extended to novel settings. Each of the algorithms discussed is tested via either simulated or real data or both. The Appendix proves that a greedy form of the l2 algorithm converges to the minimum value of the objective function.

...read moreread less

821 citations

Journal Article•DOI•

Direct importance estimation for covariate shift adaptation

[...]

Masashi Sugiyama¹, Taiji Suzuki², Shinichi Nakajima³, Hisashi Kashima, Paul von Bünau⁴, Motoaki Kawanabe - Show less +2 more•Institutions (4)

Tokyo Institute of Technology¹, University of Tokyo², Nikon³, Technical University of Berlin⁴

30 Aug 2008-Annals of the Institute of Statistical Mathematics

TL;DR: This paper proposes a direct importance estimation method that does not involve density estimation and is equipped with a natural cross validation procedure and hence tuning parameters such as the kernel width can be objectively optimized.

...read moreread less

Abstract: A situation where training and test samples follow different input distributions is called covariate shift. Under covariate shift, standard learning methods such as maximum likelihood estimation are no longer consistent—weighted variants according to the ratio of test and training input densities are consistent. Therefore, accurately estimating the density ratio, called the importance, is one of the key issues in covariate shift adaptation. A naive approach to this task is to first estimate training and test input densities separately and then estimate the importance by taking the ratio of the estimated densities. However, this naive approach tends to perform poorly since density estimation is a hard task particularly in high dimensional cases. In this paper, we propose a direct importance estimation method that does not involve density estimation. Our method is equipped with a natural cross validation procedure and hence tuning parameters such as the kernel width can be objectively optimized. Furthermore, we give rigorous mathematical proofs for the convergence of the proposed algorithm. Simulations illustrate the usefulness of our approach.

...read moreread less

418 citations

Journal Article•DOI•

A cascade learning system for classification of diabetes disease: Generalized Discriminant Analysis and Least Square Support Vector Machine

[...]

Kemal Polat¹, Salih Güneş¹, Ahmet Arslan¹•Institutions (1)

Selçuk University¹

01 Jan 2008-Expert Systems With Applications

TL;DR: The aim of this study is to diagnosis of diabetes disease, which is one of the most important diseases in medical field using Generalized Discriminant Analysis (GDA) and Le least Square Support Vector Machine (LS-SVM) and a new cascade learning system based on Generalizeddiscriminant analysis and Least Square support Vector Machine is proposed.

...read moreread less

Abstract: The aim of this study is to diagnosis of diabetes disease, which is one of the most important diseases in medical field using Generalized Discriminant Analysis (GDA) and Least Square Support Vector Machine (LS-SVM). Also, we proposed a new cascade learning system based on Generalized Discriminant Analysis and Least Square Support Vector Machine. The proposed system consists of two stages. The first stage, we have used Generalized Discriminant Analysis to discriminant feature variables between healthy and patient (diabetes) data as pre-processing process. The second stage, we have used LS-SVM in order to classification of diabetes dataset. While LS-SVM obtained 78.21% classification accuracy using 10-fold cross validation, the proposed system called GDA-LS-SVM obtained 82.05% classification accuracy using 10-fold cross validation. The robustness of the proposed system is examined using classification accuracy, k-fold cross-validation method and confusion matrix. The obtained classification accuracy is 82.05% and it is very promising compared to the previously reported classification techniques.

...read moreread less

299 citations

Journal Article•DOI•

Cross-validation of component models: A critical look at current methods

[...]

Rasmus Bro¹, Karin Kjeldahl¹, Age K. Smilde, Henk A.L. Kiers²•Institutions (2)

University of Copenhagen¹, University of Groningen²

24 Jan 2008-Analytical and Bioanalytical Chemistry

TL;DR: In this paper, the most commonly used generic PCA cross-validation schemes are reviewed and how well they work in various scenarios are assessed.

...read moreread less

Abstract: In regression, cross-validation is an effective and popular approach that is used to decide, for example, the number of underlying features, and to estimate the average prediction error. The basic principle of cross-validation is to leave out part of the data, build a model, and then predict the left-out samples. While such an approach can also be envisioned for component models such as principal component analysis (PCA), most current implementations do not comply with the essential requirement that the predictions should be independent of the entity being predicted. Further, these methods have not been properly reviewed in the literature. In this paper, we review the most commonly used generic PCA cross-validation schemes and assess how well they work in various scenarios.

...read moreread less

293 citations

Journal Article•DOI•

Geostatistical interpolation using copulas

[...]

András Bárdossy¹, Jing Li¹•Institutions (1)

University of Stuttgart¹

01 Jul 2008-Water Resources Research

TL;DR: In this paper, the authors used copulas to estimate the dependence structure of groundwater quality parameters without the influence of the marginal distribution, which can be used to define confidence intervals which depend on both the observation geometry and values.

...read moreread less

Abstract: [1] In many applications of geostatistical methods, the dependence structure of the investigated parameter is described solely with the variogram or covariance functions, which are susceptible to measurement anomalies and implies the assumption of Gaussian dependence Moreover the kriging variance respects only observation density, data geometry and the variogram model To address these problems, we borrow the idea from copulas, to depict the dependence structure without the influence of the marginal distribution The methodology and basic hypotheses for application of copulas as geostatistical methods are discussed and the Gaussian copula as well as a non-Gaussian copula are used in this paper Copula parameters are estimated using a division of the observations into multipoint subsets and a subsequent maximization of the corresponding likelihood function The interpolation is carried out with two different copulas, where the expected and median values are calculated from the copulas conditioned with the nearby observations The full conditional copulas provide the estimation distributions for the unobserved locations and can be used to define confidence intervals which depend on both the observation geometry and values Observations of a large scale groundwater quality measurement network in Baden-Wurttemberg are used to demonstrate the methodology Five groundwater quality parameters: chloride, nitrate, pH, sulfate and dissolved oxygen are investigated All five parameters show non-Gaussian dependence The copula-based interpolation results of the five parameters are compared to the results of conventional ordinary and indicator kriging Different statistical measures including mean squared error, relative differences and probability scores are used to compare cross validation and split sampling results of the interpolation methods The non-Gaussian copulas give better results than the geostatistical interpolations Validation of the confidence intervals shows that they are more realistic than the estimation variances obtained by ordinary kriging

...read moreread less

254 citations

Journal Article•DOI•

A comparative study of different machine learning methods on microarray gene expression data.

[...]

Mehdi Pirooznia¹, Jack Y. Yang², Mary Qu Yang³, Youping Deng¹•Institutions (3)

University of Southern Mississippi¹, Harvard University², United States Department of Health and Human Services³

20 Mar 2008-BMC Genomics

TL;DR: The importance of feature selection in accurately classifying new samples and how an integrated feature selection and classification algorithm is performing and is capable of identifying significant genes are revealed.

...read moreread less

Abstract: Several classification and feature selection methods have been studied for the identification of differentially expressed genes in microarray data. Classification methods such as SVM, RBF Neural Nets, MLP Neural Nets, Bayesian, Decision Tree and Random Forrest methods have been used in recent studies. The accuracy of these methods has been calculated with validation methods such as v-fold validation. However there is lack of comparison between these methods to find a better framework for classification, clustering and analysis of microarray gene expression results. In this study, we compared the efficiency of the classification methods including; SVM, RBF Neural Nets, MLP Neural Nets, Bayesian, Decision Tree and Random Forrest methods. The v-fold cross validation was used to calculate the accuracy of the classifiers. Some of the common clustering methods including K-means, DBC, and EM clustering were applied to the datasets and the efficiency of these methods have been analysed. Further the efficiency of the feature selection methods including support vector machine recursive feature elimination (SVM-RFE), Chi Squared, and CSF were compared. In each case these methods were applied to eight different binary (two class) microarray datasets. We evaluated the class prediction efficiency of each gene list in training and test cross-validation using supervised classifiers. We presented a study in which we compared some of the common used classification, clustering, and feature selection methods. We applied these methods to eight publicly available datasets, and compared how these methods performed in class prediction of test datasets. We reported that the choice of feature selection methods, the number of genes in the gene list, the number of cases (samples) substantially influence classification success. Based on features chosen by these methods, error rates and accuracy of several classification algorithms were obtained. Results revealed the importance of feature selection in accurately classifying new samples and how an integrated feature selection and classification algorithm is performing and is capable of identifying significant genes.

...read moreread less

244 citations

Journal Article•DOI•

Model selection approaches for non-linear system identification: a review

[...]

Xia Hong¹, Richard Mitchell¹, Sheng Chen², Chris Harris², Kang Li³, George W. Irwin³ - Show less +2 more•Institutions (3)

University of Reading¹, University of Southampton², Queen's University Belfast³

01 Oct 2008-International Journal of Systems Science

TL;DR: A systematic overview of basic research on model selection approaches for linear-in-the-parameter models, including Bayesian parameter regularisation and models selective criteria based on the cross validation and experimental design is presented.

...read moreread less

Abstract: The identification of non-linear systems using only observed finite datasets has become a mature research area over the last two decades. A class of linear-in-the-parameter models with universal approximation capabilities have been intensively studied and widely used due to the availability of many linear-learning algorithms and their inherent convergence conditions. This article presents a systematic overview of basic research on model selection approaches for linear-in-the-parameter models. One of the fundamental problems in non-linear system identification is to find the minimal model with the best model generalisation performance from observational data only. The important concepts in achieving good model generalisation used in various non-linear system-identification algorithms are first reviewed, including Bayesian parameter regularisation and models selective criteria based on the cross validation and experimental design. A significant advance in machine learning has been the development of the support vector machine as a means for identifying kernel models based on the structural risk minimisation principle. The developments on the convex optimisation-based model construction algorithms including the support vector regression algorithms are outlined. Input selection algorithms and on-line system identification algorithms are also included in this review. Finally, some industrial applications of non-linear models are discussed.

...read moreread less

223 citations

Posted Content•

Optimal Bandwidth Selection for Conditional Efficiency Measures: a Data-driven Approach

[...]

Luiza Badin¹, Cinzia Daraio², Léopold Simar³, Léopold Simar⁴•Institutions (4)

Bucharest University of Economic Studies¹, University of Bologna², Université catholique de Louvain³, University of Toulouse⁴

01 Jan 2008

TL;DR: The approach, based on a Least Squares Cross Validation procedure (LSCV), provides an optimal bandwidth that minimizes an appropriate (weighted) integrated Squared Error (ISE) and is exemplified with some simulated data with univariate and multivariate environmental factors.

...read moreread less

Abstract: In productivity analysis an important issue is to detect how external (environmental) factors, exogenous to the production process and not under the control of the producer, might influence the production process and the resulting efficiency of the firms. Most of the traditional approaches proposed in the literature have serious drawbacks. An alternative approach is to describe the production process as being conditioned by a given value of the environmental variables (Cazals, Florens and Simar, 2002, Daraio and Simar, 2005). This defines conditional efficiency measures where the production set in the input × output space may depend on the value of the external variables. The statistical properties of nonparametric estimators of these conditional measures are now established (Jeong, Park and Simar, 2008). These involve the estimation of a nonstandard conditional distribution function which requires the specification of a smoothing parameter (a bandwidth). So far, only the asymptotic optimal order of this bandwidth has been established. This is of little interest for the practitioner. In this paper we fill this gap and we propose a data-driven technique for selecting this parameter in practice. The approach, based on a Least Squares Cross Validation procedure (LSCV), provides an optimal bandwidth that minimizes an appropriate integrated Mean Squared Error (MSE). The method is carefully described and exemplified with some simulated data with univariate and multivariate environmental factors. An application on real data (performances of Mutual Funds) illustrates how this new optimal method of bandwidth selection outperforms former methods.

...read moreread less

186 citations

Proceedings Article•DOI•

On the dangers of cross-validation. An experimental evaluation

[...]

R. Bharat Rao, Glenn Fung

01 Jan 2008

TL;DR: It is empirically show how under such large number of models the risk for overfitting increases and the performance estimated by cross validation is no longer an effective estimate of generalization; hence, this paper provides an empirical reminder of the dangers of cross validation.

...read moreread less

Abstract: Cross validation allows models to be tested using the full training set by means of repeated resampling; thus, maximizing the total number of points used for testing and potentially, helping to protect against overfitting. Improvements in computational power, recent reductions in the (computational) cost of classification algorithms, and the development of closed-form solutions (for performing cross validation in certain classes of learning algorithms) makes it possible to test thousand or millions of variants of learning models on the data. Thus, it is now possible to calculate cross validation performance on a much larger number of tuned models than would have been possible otherwise. However, we empirically show how under such large number of models the risk for overfitting increases and the performance estimated by cross validation is no longer an effective estimate of generalization; hence, this paper provides an empirical reminder of the dangers of cross validation. We use a closed-form solution that makes this evaluation possible for the cross validation problem of interest. In addition, through extensive experiments we expose and discuss the effects of the overuse/misuse of cross validation in various aspects, including model selection, feature selection, and data dimensionality. This is illustrated on synthetic, benchmark, and real-world data sets.

...read moreread less

Journal Article•DOI•

Generalized Linear Discriminant Analysis: A Unified Framework and Efficient Model Selection

[...]

Shuiwang Ji¹, Jieping Ye¹•Institutions (1)

Arizona State University¹

01 Oct 2008-IEEE Transactions on Neural Networks

TL;DR: A unified framework for generalized LDA is proposed, which elucidates the properties of various algorithms and their relationships, and shows that the matrix computations involved in LDA-based algorithms can be simplified so that the cross-validation procedure for model selection can be performed efficiently.

...read moreread less

Abstract: High-dimensional data are common in many domains, and dimensionality reduction is the key to cope with the curse-of-dimensionality. Linear discriminant analysis (LDA) is a well-known method for supervised dimensionality reduction. When dealing with high-dimensional and low sample size data, classical LDA suffers from the singularity problem. Over the years, many algorithms have been developed to overcome this problem, and they have been applied successfully in various applications. However, there is a lack of a systematic study of the commonalities and differences of these algorithms, as well as their intrinsic relationships. In this paper, a unified framework for generalized LDA is proposed, which elucidates the properties of various algorithms and their relationships. Based on the proposed framework, we show that the matrix computations involved in LDA-based algorithms can be simplified so that the cross-validation procedure for model selection can be performed efficiently. We conduct extensive experiments using a collection of high-dimensional data sets, including text documents, face images, gene expression data, and gene expression pattern images, to evaluate the proposed theories and algorithms.

...read moreread less

Journal Article•DOI•

Three way k-fold cross-validation of resource selection functions

[...]

Trevor S. Wiens, Brenda C. Dale¹, Mark S. Boyce², G. Peter Kershaw²•Institutions (2)

Canadian Wildlife Service¹, University of Alberta²

10 Apr 2008-Ecological Modelling

TL;DR: The proposed 3-way method provides the means to not only select the model with the best predictive power, but to understand the limitations of all models under consideration, and results of best models using an independent field season are presented.

...read moreread less

Journal Article•DOI•

Automated Diagnosis of Coronary Artery Disease Based on Data Mining and Fuzzy Modeling

[...]

Markos G. Tsipouras, Themis P. Exarchos¹, Dimitrios I. Fotiadis¹, Anna Kotsia¹, Konstantinos Vakalis¹, Katerina K. Naka¹, Lampros K. Michalis¹ - Show less +3 more•Institutions (1)

University of Ioannina¹

01 Jul 2008

TL;DR: The system offers several advantages since it is automatically generated, it provides CAD diagnosis based on easily and noninvasively acquired features, and is able to provide interpretation for the decisions made.

...read moreread less

Abstract: A fuzzy rule-based decision support system (DSS) is presented for the diagnosis of coronary artery disease (CAD). The system is automatically generated from an initial annotated dataset, using a four stage methodology: 1) induction of a decision tree from the data; 2) extraction of a set of rules from the decision tree, in disjunctive normal form and formulation of a crisp model; 3) transformation of the crisp set of rules into a fuzzy model; and 4) optimization of the parameters of the fuzzy model. The dataset used for the DSS generation and evaluation consists of 199 subjects, each one characterized by 19 features, including demographic and history data, as well as laboratory examinations. Tenfold cross validation is employed, and the average sensitivity and specificity obtained is 62% and 54%, respectively, using the set of rules extracted from the decision tree (first and second stages), while the average sensitivity and specificity increase to 80% and 65%, respectively, when the fuzzification and optimization stages are used. The system offers several advantages since it is automatically generated, it provides CAD diagnosis based on easily and noninvasively acquired features, and is able to provide interpretation for the decisions made.

...read moreread less

Journal Article•DOI•

A resampling scheme for regional climate simulations and its performance compared to a dynamical RCM

[...]

B. Orlowsky¹, Friedrich-Wilhelm Gerstengarbe¹, Peter C. Werner¹•Institutions (1)

Potsdam Institute for Climate Impact Research¹

01 May 2008-Theoretical and Applied Climatology

TL;DR: In this article, a new statistical method for regional climate simulations is introduced, which is constrained only by the parameters of a linear regression line for a characteristic climatological variable, and is evaluated by means of a cross validation experiment for the Elbe river basin.

...read moreread less

Abstract: A new statistical method for regional climate simulations is introduced. Its simulations are constrained only by the parameters of a linear regression line for a characteristic climatological variable. Simulated series are generated by resampling from segments of observation series such that the resulting series comply with the prescribed regression parameters and possess realistic annual cycles and persistence. The resampling guarantees that the simulated series are physically consistent both with respect to the combinations of different meteorological variables and to their spatial distribution at each time step. The resampling approach is evaluated by means of a cross validation experiment for the Elbe river basin: Its simulations are compared both to an observed climatology and to data simulated by a dynamical RCM. This cross validation shows that the approach is able to reproduce the observed climatology with respect to statistics such as long-term means, persistence features (e.g., dry spells) and extreme events. The agreement of its simulations with the observational data is much closer than for the RCM data.

...read moreread less

Journal Article•DOI•

Sliced Regression for Dimension Reduction

[...]

Hansheng Wang¹, Yingcun Xia•Institutions (1)

Peking University¹

01 Jun 2008-Journal of the American Statistical Association

TL;DR: In this article, a new dimension reduction method involving slicing the region of the response and applying local kernel regression to each slice is proposed, which is free of the linearity condition and has much better estimation accuracy.

...read moreread less

Abstract: A new dimension-reduction method involving slicing the region of the response and applying local kernel regression to each slice is proposed. Compared with the traditional inverse regression methods [e.g., sliced inverse regression (SIR)], the new method is free of the linearity condition and has much better estimation accuracy. Compared with the direct estimation methods (e.g., MAVE), the new method is much more robust against extreme values and can capture the entire central subspace (CS) exhaustively. To determine the CS dimension, a consistent cross-validation criterion is developed. Extensive numerical studies, including a real example, confirm our theoretical findings.

...read moreread less

Journal Article•DOI•

Class-modeling techniques, classic and new, for old and new problems

[...]

Michele Forina¹, Paolo Oliveri¹, Silvia Lanteri¹, Monica Casale¹•Institutions (1)

University of Genoa¹

15 Oct 2008-Chemometrics and Intelligent Laboratory Systems

TL;DR: The performances of class-modeling techniques, both in classification and in modeling, have been evaluated on real data sets, with the original variables and on subsets of variables obtained after elimination of non-discriminant variables.

...read moreread less

Journal Article•DOI•

Consistency of cross validation for comparing regression procedures

[...]

Yuhong Yang

20 Mar 2008-arXiv: Statistics Theory

TL;DR: It is shown that under some conditions, with an appropriate choice of data splitting ratio, cross validation is consistent in the sense of selecting the better procedure with probability approaching 1.

...read moreread less

Abstract: Theoretical developments on cross validation (CV) have mainly focused on selecting one among a list of finite-dimensional models (e.g., subset or order selection in linear regression) or selecting a smoothing parameter (e.g., bandwidth for kernel smoothing). However, little is known about consistency of cross validation when applied to compare between parametric and nonparametric methods or within nonparametric methods. We show that under some conditions, with an appropriate choice of data splitting ratio, cross validation is consistent in the sense of selecting the better procedure with probability approaching 1. Our results reveal interesting behavior of cross validation. When comparing two models (procedures) converging at the same nonparametric rate, in contrast to the parametric case, it turns out that the proportion of data used for evaluation in CV does not need to be dominating in size. Furthermore, it can even be of a smaller order than the proportion for estimation while not affecting the consistency property.

...read moreread less

Journal Article•DOI•

Bayes factors: Prior sensitivity and model generalizability

[...]

Charles Liu¹, Murray Aitkin²•Institutions (2)

Monash University¹, University of Melbourne²

01 Dec 2008-Journal of Mathematical Psychology

TL;DR: This paper argued that the Bayes factor can be used for assessing model generalizability and that the generalization criterion has also been proposed for the same purpose, but they argued that these two methods address different levels of model generalizeability (local and global), and will often produce divergent conclusions.

...read moreread less

Journal Article•DOI•

Model complexity control for hydrologic prediction

[...]

Gerrit Schoups¹, N.C. Van de Giesen¹, Hubert H. G. Savenije¹•Institutions (1)

Delft University of Technology¹

01 Dec 2008-Water Resources Research

TL;DR: In this paper, the authors compare three model complexity control methods for hydrologic prediction, namely, cross validation (CV), Akaike's information criterion (AIC), and structural risk minimization (SRM).

...read moreread less

Abstract: A common concern in hydrologic modeling is overparameterization of complex models given limited and noisy data. This leads to problems of parameter nonuniqueness and equifinality, which may negatively affect prediction uncertainties. A systematic way of controlling model complexity is therefore needed. We compare three model complexity control methods for hydrologic prediction, namely, cross validation (CV), Akaike's information criterion (AIC), and structural risk minimization (SRM). Results show that simulation of water flow using non-physically-based models (polynomials in this case) leads to increasingly better calibration fits as the model complexity (polynomial order) increases. However, prediction uncertainty worsens for complex non-physically-based models because of overfitting of noisy data. Incorporation of physically based constraints into the model (e.g., storage-discharge relationship) effectively bounds prediction uncertainty, even as the number of parameters increases. The conclusion is that overparameterization and equifinality do not lead to a continued increase in prediction uncertainty, as long as models are constrained by such physical principles. Complexity control of hydrologic models reduces parameter equifinality and identifies the simplest model that adequately explains the data, thereby providing a means of hydrologic generalization and classification. SRM is a promising technique for this purpose, as it (1) provides analytic upper bounds on prediction uncertainty, hence avoiding the computational burden of CV, and (2) extends the applicability of classic methods such as AIC to finite data. The main hurdle in applying SRM is the need for an a priori estimation of the complexity of the hydrologic model, as measured by its Vapnik-Chernovenkis (VC) dimension. Further research is needed in this area.

...read moreread less

Journal Article•DOI•

Cross-validated estimations in the single-functional index model

[...]

Ahmed Ait-Saïdi, Frédéric Ferraty¹, Rabah Kassa, Philippe Vieu¹•Institutions (1)

Paul Sabatier University¹

20 Oct 2008-Statistics

TL;DR: In this paper, a cross-validated method for estimating the unknown link function and the unknown functional index appearing in a single-functional index model is proposed. But the model is not suitable for nonparametric functional regression.

...read moreread less

Abstract: The functional index model consists in assuming that a functional explanatory variable acts on a scalar response only through its projection on one functional direction. The main aim of this work consists in estimating the unknown link function and the unknown functional index appearing in such a model. To this end, one focuses on an original cross-validation procedure allowing both estimations to be carried out simultaneously. This cross-validated method has several advantages: optimality property with respect to quadratic distances, ease of implementation and large flexibility for fitting and predicting purposes. One emphasizes the good behaviour in practice of the method. Finally, one discusses how such a single-functional index model can also be seen as a way of computing adaptative semi-metrics in the general frame of nonparametric functional regression.

...read moreread less

Journal Article•DOI•

Predicting landslides for risk analysis — Spatial models tested by a cross-validation technique

[...]

Chang-Jo Chung¹, Andrea G. Fabbri², Andrea G. Fabbri³•Institutions (3)

Geological Survey of Canada¹, University of Milano-Bicocca², VU University Amsterdam³

15 Feb 2008-Geomorphology

TL;DR: In this article, the authors propose a systematic procedure comprising two analytical steps: relative hazard mapping and empirically probability estimation for estimating the probabilities of future landslides, by applying a cross-validation technique.

...read moreread less

Journal Article•DOI•

An adaptive orthogonal search algorithm for model subset selection and non-linear system identification

[...]

Stephen A. Billings, Hua-Liang Wei

08 Apr 2008-International Journal of Control

TL;DR: A modified generalized cross-validation criterion, called the adjustable prediction error sum of squares (APRESS), is introduced and incorporated into a forward orthogonal search procedure and can produce efficient model subsets for most non-linear identification problems.

...read moreread less

Abstract: A new adaptive orthogonal search (AOS) algorithm is proposed for model subset selection and non-linear system identification. Model structure detection is a key step in any system identification problem. This consists of selecting significant model terms from a redundant dictionary of candidate model terms, and determining the model complexity (model length or model size). The final objective is to produce a parsimonious model that can well capture the inherent dynamics of the underlying system. In the new AOS algorithm, a modified generalized cross-validation criterion, called the adjustable prediction error sum of squares (APRESS), is introduced and incorporated into a forward orthogonal search procedure. The main advantage of the new AOS algorithm is that the mechanism is simple and the implementation is direct and easy, and more importantly it can produce efficient model subsets for most non-linear identification problems.

...read moreread less

Journal Article•DOI•

New Predictive Models for Blood—Brain Barrier Permeability of Drug-like Molecules

[...]

Sandhya Kortagere¹, Dmitriy Chekmarev¹, William J. Welsh¹, Sean Ekins², Sean Ekins¹ - Show less +1 more•Institutions (2)

University of Medicine and Dentistry of New Jersey¹, University of Maryland, Baltimore²

16 Apr 2008-Pharmaceutical Research

TL;DR: The data indicate that Shape Signatures descriptors can be used with SVM and these models may have utility for predicting blood–brain barrier permeation in drug discovery.

...read moreread less

Abstract: The goals of the present study were to apply a generalized regression model and support vector machine (SVM) models with Shape Signatures descriptors, to the domain of blood–brain barrier (BBB) modeling. The Shape Signatures method is a novel computational tool that was used to generate molecular descriptors utilized with the SVM classification technique with various BBB datasets. For comparison purposes we have created a generalized linear regression model with eight MOE descriptors and these same descriptors were also used to create SVM models. The generalized regression model was tested on 100 molecules not in the model and resulted in a correlation r 2 = 0.65. SVM models with MOE descriptors were superior to regression models, while Shape Signatures SVM models were comparable or better than those with MOE descriptors. The best 2D shape signature models had 10-fold cross validation prediction accuracy between 80–83% and leave-20%-out testing prediction accuracy between 80–82% as well as correctly predicting 84% of BBB+ compounds (n = 95) in an external database of drugs. Our data indicate that Shape Signatures descriptors can be used with SVM and these models may have utility for predicting blood–brain barrier permeation in drug discovery.

...read moreread less

Journal Article•DOI•

Residual periodograms for choosing regularization parameters for ill-posed problems

[...]

Bert W. Rust¹, Dianne P. O'Leary², Dianne P. O'Leary¹•Institutions (2)

National Institute of Standards and Technology¹, University of Maryland, College Park²

01 Jun 2008-Inverse Problems

TL;DR: In this paper, the authors survey regularization and parameter selection from a linear algebra and statistics viewpoint and compare the statistical distributions of regularized estimates of the solution and the residual, and evaluate a method for choosing the regularization parameter that makes the residuals as close as possible to white noise.

...read moreread less

Abstract: Consider an ill-posed problem transformed if necessary so that the errors in the data are independent identically normally distributed with mean zero and variance 1. We survey regularization and parameter selection from a linear algebra and statistics viewpoint and compare the statistical distributions of regularized estimates of the solution and the residual. We discuss methods for choosing a regularization parameter in order to assure that the residual for the model is statistically plausible. Ideally, as proposed by Rust (1998 Tech. Rep. NISTIR 6131, 2000 Comput. Sci. Stat. 32 333–47 ), the results of candidate parameter choices should be evaluated by plotting the resulting residual along with its periodogram and its cumulative periodogram, but sometimes an automated choice is needed. We evaluate a method for choosing the regularization parameter that makes the residuals as close as possible to white noise, using a diagnostic test based on the periodogram. We compare this method with standard techniques such as the discrepancy principle, the L-curve and generalized cross validation, showing that it performs better on two new test problems as well as a variety of standard problems.

...read moreread less

Journal Article•DOI•

An interactive Bayesian geostatistical inverse protocol for hydraulic tomography

[...]

Michael N. Fienen¹, Tom Clemo², Peter K. Kitanidis³•Institutions (3)

United States Geological Survey¹, Boise State University², Stanford University³

01 Dec 2008-Water Resources Research

TL;DR: In this paper, a Bayesian geostatistical inverse approach is applied to accommodate a flexible model with the level of complexity driven by the data and explicitly considering uncertainty, and prior information is incorporated through the selection of a parameter covariance model characterizing continuity and providing stability.

...read moreread less

Abstract: [1] Hydraulic tomography is a powerful technique for characterizing heterogeneous hydrogeologic parameters. An explicit trade-off between characterization based on measurement misfit and subjective characterization using prior information is presented. We apply a Bayesian geostatistical inverse approach that is well suited to accommodate a flexible model with the level of complexity driven by the data and explicitly considering uncertainty. Prior information is incorporated through the selection of a parameter covariance model characterizing continuity and providing stability. Often, discontinuities in the parameter field, typically caused by geologic contacts between contrasting lithologic units, necessitate subdivision into zones across which there is no correlation among hydraulic parameters. We propose an interactive protocol in which zonation candidates are implied from the data and are evaluated using cross validation and expert knowledge. Uncertainty introduced by limited knowledge of dynamic regional conditions is mitigated by using drawdown rather than native head values. An adjoint state formulation of MODFLOW-2000 is used to calculate sensitivities which are used both for the solution to the inverse problem and to guide protocol decisions. The protocol is tested using synthetic two-dimensional steady state examples in which the wells are located at the edge of the region of interest.

...read moreread less

Journal Article•DOI•

Classification of premalignant pancreatic cancer mass-spectrometry data using decision tree ensembles

[...]

Guangtao Ge¹, G. William Wong²•Institutions (2)

Tufts University¹, Johns Hopkins University School of Medicine²

11 Jun 2008-BMC Bioinformatics

TL;DR: In this cross validation framework, classifier ensembles generally have better classification accuracies compared to that of a single decision tree when applied to a pancreatic cancer proteomic dataset, thus suggesting its utility in future proteomics data analysis.

...read moreread less

Abstract: Pancreatic cancer is the fourth leading cause of cancer death in the United States. Consequently, identification of clinically relevant biomarkers for the early detection of this cancer type is urgently needed. In recent years, proteomics profiling techniques combined with various data analysis methods have been successfully used to gain critical insights into processes and mechanisms underlying pathologic conditions, particularly as they relate to cancer. However, the high dimensionality of proteomics data combined with their relatively small sample sizes poses a significant challenge to current data mining methodology where many of the standard methods cannot be applied directly. Here, we propose a novel methodological framework using machine learning method, in which decision tree based classifier ensembles coupled with feature selection methods, is applied to proteomics data generated from premalignant pancreatic cancer. This study explores the utility of three different feature selection schemas (Student t test, Wilcoxon rank sum test and genetic algorithm) to reduce the high dimensionality of a pancreatic cancer proteomic dataset. Using the top features selected from each method, we compared the prediction performances of a single decision tree algorithm C4.5 with six different decision-tree based classifier ensembles (Random forest, Stacked generalization, Bagging, Adaboost, Logitboost and Multiboost). We show that ensemble classifiers always outperform single decision tree classifier in having greater accuracies and smaller prediction errors when applied to a pancreatic cancer proteomics dataset. In our cross validation framework, classifier ensembles generally have better classification accuracies compared to that of a single decision tree when applied to a pancreatic cancer proteomic dataset, thus suggesting its utility in future proteomics data analysis. Additionally, the use of feature selection method allows us to select biomarkers with potentially important roles in cancer development, therefore highlighting the validity of this method.

...read moreread less

Journal Article•DOI•

Fast and efficient optimization of hydrologic model parameters using a priori estimates and stepwise line search

[...]

Vadim Kuzmin¹, Dong Jun Seo², Dong Jun Seo³, Victor Koren²•Institutions (3)

University of Melbourne¹, Silver Spring Networks², University Corporation for Atmospheric Research³

20 May 2008-Journal of Hydrology

TL;DR: It is shown that SLS locates the posterior parameter estimates very efficiently in the vicinity of the a priori estimates that are comparable, in terms of reducing the objective function value, to those from global minimization.

...read moreread less

Book Chapter•DOI•

Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation

[...]

Trevor Hastie

01 Jan 2008

Journal Article•DOI•

Efficient approximate leave-one-out cross-validation for kernel logistic regression

[...]

Gavin C. Cawley¹, Nicola L. C. Talbot¹•Institutions (1)

University of East Anglia¹

01 Jun 2008-Machine Learning

TL;DR: Results obtained are given, demonstrating that the proposed model selection procedure is competitive with a more conventional k-fold cross-validation based approach and also with Gaussian process (GP) classifiers implemented using the Laplace approximation and via the Expectation Propagation (EP) algorithm.

...read moreread less

Abstract: Kernel logistic regression (KLR) is the kernel learning method best suited to binary pattern recognition problems where estimates of a-posteriori probability of class membership are required Such problems occur frequently in practical applications, for instance because the operational prior class probabilities or equivalently the relative misclassification costs are variable or unknown at the time of training the model The model parameters are given by the solution of a convex optimization problem, which may be found via an efficient iteratively re-weighted least squares (IRWLS) procedure The generalization properties of a kernel logistic regression machine are however governed by a small number of hyper-parameters, the values of which must be determined during the process of model selection In this paper, we propose a novel model selection strategy for KLR, based on a computationally efficient closed-form approximation of the leave-one-out cross-validation procedure Results obtained on a variety of synthetic and real-world benchmark datasets are given, demonstrating that the proposed model selection procedure is competitive with a more conventional k-fold cross-validation based approach and also with Gaussian process (GP) classifiers implemented using the Laplace approximation and via the Expectation Propagation (EP) algorithm

...read moreread less

Collapse