scispace - formally typeset
Search or ask a question

Showing papers on "Overfitting published in 2009"


Book
16 Mar 2009
TL;DR: This paper presents a case study on survival analysis: Prediction of secondary cardiovascular events and lessons from case studies on overfitting and optimism in prediction models.
Abstract: Introduction.- Applications of prediction models.- Study design for prediction models.- Statistical models for prediction.- Overfitting and optimism in prediction models.- Choosing between alternative statistical models.- Dealing with missing values.- Case study on dealing with missing values.- Coding of categorical and continuous predictors.- Restrictions on candidate predictors.- Selection of main effects.- Assumptions in regression models: Additivity and linearity.- Modern estimation methods.- Estimation with external methods.- Evaluation of performance.- Clinical usefulness.- Validation of prediction models.- Presentation formats.- Patterns of external validity.- Updating for a new setting.- Updating for a multiple settings.- Prediction of a binary outcome: 30-day mortality after acute myocardial infarction.- Case study on survival analysis: Prediction of secondary cardiovascular events.- Lessons from case studies.

2,771 citations


Journal Article
TL;DR: This work considers regularized support vector machines and shows that they are precisely equivalent to a new robust optimization formulation, thus establishing robustness as the reason regularized SVMs generalize well and gives a new proof of consistency of (kernelized) SVMs.
Abstract: We consider regularized support vector machines (SVMs) and show that they are precisely equivalent to a new robust optimization formulation. We show that this equivalence of robust optimization and regularization has implications for both algorithms, and analysis. In terms of algorithms, the equivalence suggests more general SVM-like algorithms for classification that explicitly build in protection to noise, and at the same time control overfitting. On the analysis front, the equivalence of robustness and regularization provides a robust optimization interpretation for the success of regularized SVMs. We use this new robustness interpretation of SVMs to give a new proof of consistency of (kernelized) SVMs, thus establishing robustness as the reason regularized SVMs generalize well.

419 citations


Proceedings ArticleDOI
15 May 2009
TL;DR: This work investigates Linear Forward Selection, a technique to reduce the number of attributes expansions in each forward selection step and shows that this approach is faster, finds smaller subsets and can even increase the accuracy compared to standard forward selection.
Abstract: Scheme-specific attribute selection with the wrapper and variants of forward selection is a popular attribute selection technique for classification that yields good results. However, it can run the risk of overfitting because of the extent of the search and the extensive use of internal cross-validation. Moreover, although wrapper evaluators tend to achieve superior accuracy compared to filters, they face a high computational cost. The problems of overfitting and high runtime occur in particular on high-dimensional datasets, like microarray data. We investigate Linear Forward Selection, a technique to reduce the number of attributes expansions in each forward selection step. Our experiments demonstrate that this approach is faster, finds smaller subsets and can even increase the accuracy compared to standard forward selection. We also investigate a variant that applies explicit subset size determination in forward selection to combat overfitting, where the search is forced to stop at a precomputed “optimal” subset size. We show that this technique reduces subset size while maintaining comparable accuracy.

241 citations


Journal ArticleDOI
TL;DR: A new rule-refinement scheme is proposed that is based on the maximization of fuzzy entropy on the training set, which is expected to have the advantages of improving the generalization capability of initial fuzzy IF-THEN rules and simultaneously overcoming the overfitting of refinement.
Abstract: When fuzzy IF-THEN rules initially extracted from data have not a satisfying performance, we consider that the rules require refinement. Distinct from most existing rule-refinement approaches that are based on the further reduction of training error, this paper proposes a new rule-refinement scheme that is based on the maximization of fuzzy entropy on the training set. The new scheme, which is realized by solving a quadratic programming problem, is expected to have the advantages of improving the generalization capability of initial fuzzy IF-THEN rules and simultaneously overcoming the overfitting of refinement. Experimental results on a number of selected databases demonstrate the expected improvement of generalization capability and the prevention of overfitting by a comparison of both training and testing accuracy before and after the refinement.

222 citations


Journal ArticleDOI
TL;DR: A methodological approach to guide the choice of the SVM parameters based on a grid search for minimizing the classification error rate but also relying on the visualization of the number of support vectors (SVs), which demonstrates the interest of visualizing the SVs in principal components subspaces to go deeper into the interpretation of the trained SVM.

197 citations


Journal ArticleDOI
01 Mar 2009
TL;DR: A novel approach to enhance the prediction performance of CBR for the prediction of corporate bankruptcies is proposed by simultaneous optimization of feature weighting and the instance selection for CBR by using genetic algorithms (GAs).
Abstract: One of the most important research issues in finance is building effective corporate bankruptcy prediction models because they are essential for the risk management of financial institutions. Researchers have applied various data-driven approaches to enhance prediction performance including statistical and artificial intelligence techniques, and many of them have been proved to be useful. Case-based reasoning (CBR) is one of the most popular data-driven approaches because it is easy to apply, has no possibility of overfitting, and provides good explanation for the output. However, it has a critical limitation-its prediction performance is generally low. In this study, we propose a novel approach to enhance the prediction performance of CBR for the prediction of corporate bankruptcies. Our suggestion is the simultaneous optimization of feature weighting and the instance selection for CBR by using genetic algorithms (GAs). Our model can improve the prediction performance by referencing more relevant cases and eliminating noises. We apply our model to a real-world case. Experimental results show that the prediction accuracy of conventional CBR may be improved significantly by using our model. Our study suggests ways for financial institutions to build a bankruptcy prediction model which produces accurate results as well as good explanations for these results.

182 citations


Journal ArticleDOI
TL;DR: The conclusions are that the performance of the classifiers depends very much on the distribution of data, and it is recommended to look at the data structure prior to model building to determine the optimal type of model.

182 citations


Journal ArticleDOI
TL;DR: Experimental results on benchmark data sets give evidence that the proposed approach to particle swarm optimization (PSO) for classification tasks is very effective, despite its simplicity, and results obtained in the framework of a model selection challenge show the competitiveness of the models selected with PSO, compared to modelsselected with other techniques that focus on a single algorithm and that use domain knowledge.
Abstract: This paper proposes the application of particle swarm optimization (PSO) to the problem of full model selection, FMS, for classification tasks. FMS is defined as follows: given a pool of preprocessing methods, feature selection and learning algorithms, to select the combination of these that obtains the lowest classification error for a given data set; the task also includes the selection of hyperparameters for the considered methods. This problem generates a vast search space to be explored, well suited for stochastic optimization techniques. FMS can be applied to any classification domain as it does not require domain knowledge. Different model types and a variety of algorithms can be considered under this formulation. Furthermore, competitive yet simple models can be obtained with FMS. We adopt PSO for the search because of its proven performance in different problems and because of its simplicity, since neither expensive computations nor complicated operations are needed. Interestingly, the way the search is guided allows PSO to avoid overfitting to some extend. Experimental results on benchmark data sets give evidence that the proposed approach is very effective, despite its simplicity. Furthermore, results obtained in the framework of a model selection challenge show the competitiveness of the models selected with PSO, compared to models selected with other techniques that focus on a single algorithm and that use domain knowledge.

179 citations


Journal ArticleDOI
TL;DR: Comparisons with DBSCAN method, OPTICS method, and valley seeking clustering method further show that the proposed framework can successfully avoid the overfitting phenomenon and solve the confusion problem of cluster boundary points and outliers.
Abstract: In this paper, a new density-based clustering framework is proposed by adopting the assumption that the cluster centers in data space can be regarded as target objects in image space. First, the level set evolution is adopted to find an approximation of cluster centers by using a new initial boundary formation scheme. Accordingly, three types of initial boundaries are defined so that each of them can evolve to approach the cluster centers in different ways. To avoid the long iteration time of level set evolution in data space, an efficient termination criterion is presented to stop the evolution process in the circumstance that no more cluster centers can be found. Then, a new effective density representation called level set density (LSD) is constructed from the evolution results. Finally, the valley seeking clustering is used to group data points into corresponding clusters based on the LSD. The experiments on some synthetic and real data sets have demonstrated the efficiency and effectiveness of the proposed clustering framework. The comparisons with DBSCAN method, OPTICS method, and valley seeking clustering method further show that the proposed framework can successfully avoid the overfitting phenomenon and solve the confusion problem of cluster boundary points and outliers.

175 citations


Journal ArticleDOI
TL;DR: It is indicated that training ANNs with noise injection can reduce overfitting to a greater degree than early stopping and to a similar degree as weight decay.
Abstract: The purpose of this study was to investigate the effect of a noise injection method on the “overfitting” problem of artificial neural networks (ANNs) in two-class classification tasks. The authors compared ANNs trained with noise injection to ANNs trained with two other methods for avoiding overfitting: weight decay and early stopping. They also evaluated an automatic algorithm for selecting the magnitude of the noise injection. They performed simulation studies of an exclusive-or classification task with training datasets of 50, 100, and 200 cases (half normal and half abnormal) and an independent testing dataset of 2000 cases. They also compared the methods using a breast ultrasound dataset of 1126 cases. For simulated training datasets of 50 cases, the area under the receiver operating characteristic curve (AUC) was greater (by 0.03) when training with noise injection than when training without any regularization, and the improvement was greater than those from weight decay and early stopping (both of 0.02). For training datasets of 100 cases, noise injection and weight decay yielded similar increases in the AUC (0.02), whereas early stopping produced a smaller increase (0.01). For training datasets of 200 cases, the increases in the AUC were negligibly small for all methods (0.005). For the ultrasound dataset, noise injection had a greater average AUC than ANNs trained without regularization and a slightly greater average AUC than ANNs trained with weight decay. These results indicate that training ANNs with noise injection can reduce overfitting to a greater degree than early stopping and to a similar degree as weight decay.

162 citations


Journal ArticleDOI
TL;DR: This paper analyzes NCL and reveals that the training of NCL corresponds to training the entire ensemble as a single learning machine that only minimizes the MSE without regularization, which explains the reason why NCL is prone to overfitting the noise in the training set.
Abstract: Negative correlation learning (NCL) is a neural network ensemble learning algorithm that introduces a correlation penalty term to the cost function of each individual network so that each neural network minimizes its mean square error (MSE) together with the correlation of the ensemble. This paper analyzes NCL and reveals that the training of NCL (when ? = 1) corresponds to training the entire ensemble as a single learning machine that only minimizes the MSE without regularization. This analysis explains the reason why NCL is prone to overfitting the noise in the training set. This paper also demonstrates that tuning the correlation parameter ? in NCL by cross validation cannot overcome the overfitting problem. The paper analyzes this problem and proposes the regularized negative correlation learning (RNCL) algorithm which incorporates an additional regularization term for the whole ensemble. RNCL decomposes the ensemble's training objectives, including MSE and regularization, into a set of sub-objectives, and each sub-objective is implemented by an individual neural network. In this paper, we also provide a Bayesian interpretation for RNCL and provide an automatic algorithm to optimize regularization parameters based on Bayesian inference. The RNCL formulation is applicable to any nonlinear estimator minimizing the MSE. The experiments on synthetic as well as real-world data sets demonstrate that RNCL achieves better performance than NCL, especially when the noise level is nontrivial in the data set.

Proceedings ArticleDOI
20 Jun 2009
TL;DR: A novel Multi-Task Learning (MTL) framework is presented, called Boosted MTL, for face verification with limited training data that jointly learns classifiers for multiple people by sharing a few boosting classifiers in order to avoid overfitting.
Abstract: Face verification has many potential applications including filtering and ranking image/video search results on celebrities. Since these images/videos are taken under uncontrolled environments, the problem is very challenging due to dramatic lighting and pose variations, low resolutions, compression artifacts, etc. In addition, the available number of training images for each celebrity may be limited, hence learning individual classifiers for each person may cause overfitting. In this paper, we propose two ideas to meet the above challenges. First, we propose to use individual bins, instead of whole histograms, of Local Binary Patterns (LBP) as features for learning, which yields significant performance improvements and computation reduction in our experiments. Second, we present a novel Multi-Task Learning (MTL) framework, called Boosted MTL, for face verification with limited training data. It jointly learns classifiers for multiple people by sharing a few boosting classifiers in order to avoid overfitting. The effectiveness of Boosted MTL and LBP bin features is verified with a large number of celebrity images/videos from the web.

01 Jan 2009
TL;DR: A method for auto- matically incorporating variable selection in Fisher's linear discriminant analysis (LDA) and a generalized eigenvalue problem is developed, which overcomes the data piling problem.
Abstract: This paper develops a method for auto- matically incorporating variable selection in Fisher's linear discriminant analysis (LDA). Utilizing the con- nection of Fisher's LDA and a generalized eigenvalue problem, our approach applies the method of regu- larization to obtain sparse linear discriminant vec- tors, where "sparse" means that the discriminant vec- tors have only a small number of nonzero compo- nents. Our sparse LDA procedure is especially effec- tive in the so-called high dimensional, low sample size (HDLSS) settings, where LDA possesses the "data piling" property, that is, it maps all points from the same class in the training data to a common point, and so when viewed along the LDA projection direc- tions, the data are piled up. Data piling indicates overfitting and usually results in poor out-of-sample classification. By incorporating variable selection, the sparse LDA overcomes the data piling problem. The underlying assumption is that, among the large num- ber of variables there are many irrelevant or redun- dant variables for the purpose of classification. By using only important or significant variables we es- sentially deal with a lower dimensional problem. Both synthetic and real data sets are used to illustrate the proposed method.

Journal ArticleDOI
TL;DR: This paper proposes two genetic-algorithm based neural network (NN) models to predict customer churn in subscription of wireless services and indicates that medium sized NNs perform best and the cross entropy based criterion may be more resistant to overfitting outliers in training dataset.
Abstract: Marketing research suggests that it is more expensive to recruit a new customer than to retain an existing customer. In order to retain existing customers, academics and practitioners have developed churn prediction models to effectively manage customer churn. In this paper, we propose two genetic-algorithm (GA) based neural network (NN) models to predict customer churn in subscription of wireless services. Our first GA based NN model uses a cross entropy based criterion to predict customer churn, and our second GA based NN model attempts to directly maximize the prediction accuracy of customer churn. Using real-world cellular wireless services dataset and three different sizes of NNs, we compare the two GA based NN models with a statistical z-score model using several model evaluation criteria, which include prediction accuracy, top 10% decile lift and area under receiver operating characteristics (ROC) curve. The results of our experiments indicate that both GA based NN models outperform the statistical z-score model on all performance criteria. Further, we observe that medium sized NNs perform best and the cross entropy based criterion may be more resistant to overfitting outliers in training dataset.

Journal ArticleDOI
01 Jun 2009
TL;DR: This work presents a new approach to studying structure in inferred connections based on a Bayesian clustering algorithm that can successfully infer connectivity patterns in simulated data and apply the algorithm to spike data recorded from primary motor and premotor cortices of a monkey.
Abstract: Current multielectrode techniques enable the simultaneous recording of spikes from hundreds of neurons. To study neural plasticity and network structure it is desirable to infer the underlying functional connectivity between the recorded neurons. Functional connectivity is defined by a large number of parameters, which characterize how each neuron influences the other neurons. A Bayesian approach that combines information from the recorded spikes (likelihood) with prior beliefs about functional connectivity (prior) can improve inference of these parameters and reduce overfitting. Recent studies have used likelihood functions based on the statistics of point-processes and a prior that captures the sparseness of neural connections. Here we include a prior that captures the empirical finding that interactions tend to vary smoothly in time. We show that this method can successfully infer connectivity patterns in simulated data and apply the algorithm to spike data recorded from primary motor (M1) and premotor (PMd) cortices of a monkey. Finally, we present a new approach to studying structure in inferred connections based on a Bayesian clustering algorithm. Groups of neurons in M1 and PMd show common patterns of input and output that may correspond to functional assemblies.

Journal ArticleDOI
TL;DR: A multiple-input, multiple-output (MIMO) nonlinear dynamic model for the input-output transformation of spike trains is formulated and it is shown that this model is equivalent to a generalized linear model with a probit link function.

Journal ArticleDOI
TL;DR: The proposed unusual video event detection method is based on unsupervised clustering of object trajectories, which are modeled by hidden Markov models (HMM), and includes a dynamic hierarchical process incorporated in the trajectory clustering algorithm.
Abstract: The proposed unusual video event detection method is based on unsupervised clustering of object trajectories, which are modeled by hidden Markov models (HMM). The novelty of the method includes a dynamic hierarchical process incorporated in the trajectory clustering algorithm to prevent model overfitting and a 2-depth greedy search strategy for efficient clustering.

Journal ArticleDOI
TL;DR: An improved neural network approach is presented for the simultaneous development of accurate potential-energy hypersurfaces and corresponding force fields that can be utilized to conduct ab initio molecular dynamics and Monte Carlo studies on gas-phase chemical reactions.
Abstract: An improved neural network (NN) approach is presented for the simultaneous development of accurate potential-energy hypersurfaces and corresponding force fields that can be utilized to conduct ab initio molecular dynamics and Monte Carlo studies on gas-phase chemical reactions. The method is termed as combined function derivative approximation (CFDA). The novelty of the CFDA method lies in the fact that although the NN has only a single output neuron that represents potential energy, the network is trained in such a way that the derivatives of the NN output match the gradient of the potential-energy hypersurface. Accurate force fields can therefore be computed simply by differentiating the network. Both the computed energies and the gradients are then accurately interpolated using the NN. This approach is superior to having the gradients appear in the output layer of the NN because it greatly simplifies the required architecture of the network. The CFDA permits weighting of function fitting relative to gradient fitting. In every test that we have run on six different systems, CFDA training (without a validation set) has produced smaller out-of-sample testing error than early stopping (with a validation set) or Bayesian regularization (without a validation set). This indicates that CFDA training does a better job of preventing overfitting than the standard methods currently in use. The training data can be obtained using an empirical potential surface or any ab initio method. The accuracy and interpolation power of the method have been tested for the reaction dynamics of H+HBr using an analytical potential. The results show that the present NN training technique produces more accurate fits to both the potential-energy surface as well as the corresponding force fields than the previous methods. The fitting and interpolation accuracy is so high (rms error=1.2 cm(-1)) that trajectories computed on the NN potential exhibit point-by-point agreement with corresponding trajectories on the analytic surface.

Journal ArticleDOI
TL;DR: A new machine learning approach for predicting DNA-binding residues from amino acid sequence data based on the use of evolutionary information was found to significantly improve classifier performance.
Abstract: Protein-DNA interactions are involved in many biological processes essential for cellular function. To understand the molecular mechanism of protein-DNA recognition, it is necessary to identify the DNA-binding residues in DNA-binding proteins. However, structural data are available for only a few hundreds of protein-DNA complexes. With the rapid accumulation of sequence data, it becomes an important but challenging task to accurately predict DNA-binding residues directly from amino acid sequence data. A new machine learning approach has been developed in this study for predicting DNA-binding residues from amino acid sequence data. The approach used both the labelled data instances collected from the available structures of protein-DNA complexes and the abundant unlabeled data found in protein sequence databases. The evolutionary information contained in the unlabeled sequence data was represented as position-specific scoring matrices (PSSMs) and several new descriptors. The sequence-derived features were then used to train random forests (RFs), which could handle a large number of input variables and avoid model overfitting. The use of evolutionary information was found to significantly improve classifier performance. The RF classifier was further evaluated using a separate test dataset, and the predicted DNA-binding residues were examined in the context of three-dimensional structures. The results suggest that the RF-based approach gives rise to more accurate prediction of DNA-binding residues than previous studies. A new web server called BindN-RF http://bioinfo.ggc.org/bindn-rf/ has thus been developed to make the RF classifier accessible to the biological research community.

Journal ArticleDOI
TL;DR: Experimental results showed that the prediction accuracy of conventional CBR may be improved significantly by using the proposed model, and it was found that the model outperformed all the other optimized models for CBR using GA.
Abstract: Case-based reasoning (CBR) is one of the most popular prediction techniques in medical domains because it is easy to apply, has no possibility of overfitting, and provides a good explanation for the output. However, it has a critical limitation - its prediction performance is generally lower than other AI techniques like artificial neural networks (ANN). In order to obtain accurate results from CBR, effective retrieval and matching of useful prior cases for the problem is essential, but it is still a controversial issue to design a good matching and retrieval mechanism for CBR systems. In this study, we propose a novel approach to enhance the prediction performance of CBR. Our suggestion is the simultaneous optimization of feature weights, instance selection, and the number of neighbors that combine using genetic algorithms (GA). Our model improves the prediction performance in three ways - (1) measuring similarity between cases more accurately by considering relative importance of each feature, (2) eliminating useless or erroneous reference cases, and (3) combining several similar cases represent significant patterns. To validate the usefulness of our model, this study applied it to a real-world case for evaluating cytological features derived directly from a digital scan of breast fine needle aspirate (FNA) slides. Experimental results showed that the prediction accuracy of conventional CBR may be improved significantly by using our model. We also found that our proposed model outperformed all the other optimized models for CBR using GA.

Proceedings ArticleDOI
01 Sep 2009
TL;DR: This paper proposes a machinery called the Heterogeneous Feature Machine (HFM), which builds a kernel logistic regression model based on similarities that combine different features and distance metrics to effectively solve visual recognition tasks in need of multiple types of features.
Abstract: With the recent efforts made by computer vision researchers, more and more types of features have been designed to describe various aspects of visual characteristics. Modeling such heterogeneous features has become an increasingly critical issue. In this paper, we propose a machinery called the Heterogeneous Feature Machine (HFM) to effectively solve visual recognition tasks in need of multiple types of features. Our HFM builds a kernel logistic regression model based on similarities that combine different features and distance metrics. Different from existing approaches that use a linear weighting scheme to combine different features, HFM does not require the weights to remain the same across different samples, and therefore can effectively handle features of different types with different metrics. To prevent the model from overfitting, we employ the so-called group LASSO constraints to reduce model complexity. In addition, we propose a fast algorithm based on co-ordinate gradient descent to efficiently train a HFM. The power of the proposed scheme is demonstrated across a wide variety of visual recognition tasks including scene, event and action recognition.

Proceedings ArticleDOI
20 Jun 2009
TL;DR: A novel algorithm based on flow velocity field estimation to count the number of pedestrians across a detection line or inside a specified region to ensure direct deployment and avoid overfitting while the commonly used scene-specific learning scheme needs on-site annotation and always trends to overfitting.
Abstract: In this paper, we present a novel algorithm based on flow velocity field estimation to count the number of pedestrians across a detection line or inside a specified region. We regard pedestrians across the line as fluid flow, and design a novel model to estimate the flow velocity field. By integrating over time, the dynamic mosaics are constructed to count the number of pixels and edges passed through the line. Consequentially, the number of pedestrians can be estimated by quadratic regression, with the number of weighted pixels and edges as input. The regressors are learned off line from several camera tilt angles, and have taken the calibration information into account. We use tilt-angle-specific learning to ensure direct deployment and avoid overfitting while the commonly used scene-specific learning scheme needs on-site annotation and always trends to overfitting. Experiments on a variety of videos verified that the proposed method can give accurate estimation under different camera setup in real-time.

Journal ArticleDOI
TL;DR: In this paper, a hybrid classifier for polarimetric SAR images is proposed, which consists of span image, the H/A/α decomposition, and the gray-level co-occurrence matrix (GLCM) based texture features.
Abstract: This paper proposes a hybrid classifier for polarimetric SAR images. The feature sets consist of span image, the H/A/α decomposition, and the gray-level co-occurrence matrix (GLCM) based texture features. Then, the features are reduced by principle component analysis (PCA). A 3-layer neural network (NN) is constructed, trained by resilient back-propagation (RPROP) method to fasten the training and early stop (ES) method to prevent the overfitting. The results of San Francisco and Flevoland sites compared to Wishart Maximum Likelihood and wavelet-based method demonstrate the validness of our method in terms of confusion matrix and overall accuracy. In addition, NNs with and without PCA are compared. Results show the NN with PCA is more accurate and faster.

Journal ArticleDOI
TL;DR: EigenMS is an adaptation of the surrogate variable analysis (SVA) algorithm of Leek and Storey, with the adaptations including a novel approach to preventing overfitting that facilitates the incorporation of EigenMS into an existing proteomics analysis pipeline.
Abstract: Motivation: LC-MS allows for the identification and quantification of proteins from biological samples. As with any high-throughput technology, systematic biases are often observed in LC-MS data, making normalization an important preprocessing step. Normalization models need to be flexible enough to capture biases of arbitrary complexity, while avoiding overfitting that would invalidate downstream statistical inference. Careful normalization of MS peak intensities would enable greater accuracy and precision in quantitative comparisons of protein abundance levels. Results: We propose an algorithm, called EigenMS, that uses singular value decomposition to capture and remove biases from LC-MS peak intensity measurements. EigenMS is an adaptation of the surrogate variable analysis (SVA) algorithm of Leek and Storey, with the adaptations including (i) the handling of the widespread missing measurements that are typical in LC-MS, and (ii) a novel approach to preventing overfitting that facilitates the incorporation of EigenMS into an existing proteomics analysis pipeline. EigenMS is demonstrated using both large-scale calibration measurements and simulations to perform well relative to existing alternatives. Availability: The software has been made available in the open source proteomics platform DAnTE (Polpitiya et al., 2008)) ( http://omics.pnl.gov/software/), as well as in standalone software available at SourceForge (http://sourceforge.net). Contact: yuliya@stat.tamu.edu Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: It is shown that overfitting can be detected at the selection stage, and the global validation strategy can be used to measure the relationship between diversity and classification performance when diversity measures are employed as single-objective functions.

Journal ArticleDOI
TL;DR: The main results obtained in this paper not only extend the previously known results for i.i.d. observations to the case of exponentially strongly mixing observations, but also improve the previous results for strongly mixing samples.
Abstract: The generalization performance is the main concern of machine learning theoretical research. The previous main bounds describing the generalization ability of the Empirical Risk Minimization (ERM) algorithm are based on independent and identically distributed (i.i.d.) samples. In order to study the generalization performance of the ERM algorithm with dependent observations, we first establish the exponential bound on the rate of relative uniform convergence of the ERM algorithm with exponentially strongly mixing observations, and then we obtain the generalization bounds and prove that the ERM algorithm with exponentially strongly mixing observations is consistent. The main results obtained in this paper not only extend the previously known results for i.i.d. observations to the case of exponentially strongly mixing observations, but also improve the previous results for strongly mixing samples. Because the ERM algorithm is usually very time-consuming and overfitting may happen when the complexity of the hypothesis space is high, as an application of our main results we also explore a new strategy to implement the ERM algorithm in high complexity hypothesis space.

Journal ArticleDOI
TL;DR: Fuzzy c-means (FCM) clustering provides a powerful research tool for metabolomics with improved visualization, accurate classification, and outlier estimation paired with PLS.
Abstract: Fuzzy c-means (FCM) clustering is an unsupervised method derived from fuzzy logic that is suitable for solving multiclass and ambiguous clustering problems. In this study, FCM clustering is applied to cluster metabolomics data. FCM is performed directly on the data matrix to generate a membership matrix which represents the degree of association the samples have with each cluster. The method is parametrized with the number of clusters (C) and the fuzziness coefficient (m), which denotes the degree of fuzziness in the algorithm. Both have been optimized by combining FCM with partial least-squares (PLS) using the membership matrix as the Y matrix in the PLS model. The quality parameters R(2)Y and Q(2) of the PLS model have been used to monitor and optimize C and m. Data of metabolic profiles from three gene types of Escherichia coli were used to demonstrate the method above. Different multivariable analysis methods have been compared. Principal component analysis failed to model the metabolite data, while partial least-squares discriminant analysis yielded results with overfitting. On the basis of the optimized parameters, the FCM was able to reveal main phenotype changes and individual characters of three gene types of E. coli. Coupled with PLS, FCM provides a powerful research tool for metabolomics with improved visualization, accurate classification, and outlier estimation.

Proceedings ArticleDOI
28 Jun 2009
TL;DR: Interestingly, it is shown that the proposed approximate formulation can be transformed into an instance of the minimum s-t cut problem, which can be solved efficiently by finding maximum flows, and the efficiency of the proposed algorithm based on the maximum flow is shown.
Abstract: Mining discrete patterns in binary data is important for subsampling, compression, and clustering. We consider rank-one binary matrix approximations that identify the dominant patterns of the data, while preserving its discrete property. A best approximation on such data has a minimum set of inconsistent entries, i.e., mismatches between the given binary data and the approximate matrix. Due to the hardness of the problem, previous accounts of such problems employ heuristics and the resulting approximation may be far away from the optimal one. In this paper, we show that the rank-one binary matrix approximation can be reformulated as a 0-1 integer linear program (ILP). However, the ILP formulation is computationally expensive even for small-size matrices. We propose a linear program (LP) relaxation, which is shown to achieve a guaranteed approximation error bound. We further extend the proposed formulations using the regularization technique, which is commonly employed to address overfitting. The LP formulation is restricted to medium-size matrices, due to the large number of variables involved for large matrices. Interestingly, we show that the proposed approximate formulation can be transformed into an instance of the minimum s-t cut problem, which can be solved efficiently by finding maximum flows. Our empirical study shows the efficiency of the proposed algorithm based on the maximum flow. Results also confirm the established theoretical bounds.

Journal ArticleDOI
15 Jul 2009-Geoderma
TL;DR: Support Vector Regression is a set of techniques in which model complexity is limited by the learning algorithm itself, which prevents overfitting and the resulting projection in feature space is especially suited for sparse datasets.

Journal ArticleDOI
18 Sep 2009-PLOS ONE
TL;DR: Three approaches to dealing with cluster-correlated data are compared: unmodified Breiman's Random Forest (URF), forest grown using subject-level averages (SLA), and RF++ with subject- level bootstrapping (SLB); RF++ is shown to be an effective alternative for classifying both clustered and non-clustered data.
Abstract: Many mass spectrometry-based studies, as well as other biological experiments produce cluster-correlated data. Failure to account for correlation among observations may result in a classification algorithm overfitting the training data and producing overoptimistic estimated error rates and may make subsequent classifications unreliable. Current common practice for dealing with replicated data is to average each subject replicate sample set, reducing the dataset size and incurring loss of information. In this manuscript we compare three approaches to dealing with cluster-correlated data: unmodified Breiman's Random Forest (URF), forest grown using subject-level averages (SLA), and RF++ with subject-level bootstrapping (SLB). RF++, a novel Random Forest-based algorithm implemented in C++, handles cluster-correlated data through a modification of the original resampling algorithm and accommodates subject-level classification. Subject-level bootstrapping is an alternative sampling method that obviates the need to average or otherwise reduce each set of replicates to a single independent sample. Our experiments show nearly identical median classification and variable selection accuracy for SLB forests and URF forests when applied to both simulated and real datasets. However, the run-time estimated error rate was severely underestimated for URF forests. Predictably, SLA forests were found to be more severely affected by the reduction in sample size which led to poorer classification and variable selection accuracy. Perhaps most importantly our results suggest that it is reasonable to utilize URF for the analysis of cluster-correlated data. Two caveats should be noted: first, correct classification error rates must be obtained using a separate test dataset, and second, an additional post-processing step is required to obtain subject-level classifications. RF++ is shown to be an effective alternative for classifying both clustered and non-clustered data. Source code and stand-alone compiled versions of command-line and easy-to-use graphical user interface (GUI) versions of RF++ for Windows and Linux as well as a user manual (Supplementary File S2) are available for download at: http://sourceforge.org/projects/rfpp/ under the GNU public license.