scispace - formally typeset
Search or ask a question

Showing papers on "Overfitting published in 2003"


Proceedings Article
01 Jan 2003
TL;DR: In this article, a new classification approach, CPAR (Classification based on Predictive Association Rules), which combines the advantages of both associative classification and traditional rule-based classification, is proposed.
Abstract: Recent studies in data mining have proposed a new classification approach, called associative classification, which, according to several reports, such as [7, 6], achieves higher classification accuracy than traditional classification approaches such as C4.5. However, the approach also suffers from two major deficiencies: (1) it generates a very large number of association rules, which leads to high processing overhead; and (2) its confidence-based rule evaluation measure may lead to overfitting. In comparison with associative classification, traditional rule-based classifiers, such as C4.5, FOIL and RIPPER, are substantially faster but their accuracy, in most cases, may not be as high. In this paper, we propose a new classification approach, CPAR (Classification based on Predictive Association Rules), which combines the advantages of both associative classification and traditional rule-based classification. Instead of generating a large number of candidate rules as in associative classification, CPAR adopts a greedy algorithm to generate rules directly from training data. Moreover, CPAR generates and tests more rules than traditional rule-based classifiers to avoid missing important rules. To avoid overfitting, CPAR uses expected accuracy to evaluate each rule and uses the best k rules in prediction.

892 citations


Journal Article
TL;DR: This paper addresses a common methodological flaw in the comparison of variable selection methods by addressing the problem of cross-validation performance estimates of the different variable subsets used with computationally intensive search algorithms.
Abstract: This paper addresses a common methodological flaw in the comparison of variable selection methods. A practical approach to guide the search or the selection process is to compute cross-validation performance estimates of the different variable subsets. Used with computationally intensive search algorithms, these estimates may overfit and yield biased predictions. Therefore, they cannot be used reliably to compare two selection methods, as is shown by the empirical results of this paper. Instead, like in other instances of the model selection problem, independent test sets should be used for determining the final performance. The claims made in the literature about the superiority of more exhaustive search algorithms over simpler ones are also revisited, and some of them infirmed.

512 citations


Journal ArticleDOI
TL;DR: This work develops a novel approach to model selection, which is based on the Bayesian information criterion, but incorporates relative branch-length error as a performance measure in a decision theory (DT) framework.
Abstract: Phylogenetic estimation has largely come to rely on explicitly model-based methods. This approach requires that a model be chosen and that that choice be justified. To date, justification has largely been accomplished through use of likelihood-ratio tests (LRTs) to assess the relative fit of a nested series of reversible models. While this approach certainly represents an important advance over arbitrary model selection, the best fit of a series of models may not always provide the most reliable phylogenetic estimates for finite real data sets, where all available models are surely incorrect. Here, we develop a novel approach to model selection, which is based on the Bayesian information criterion, but incorporates relative branch-length error as a performance measure in a decision theory (DT) framework. This DT method includes a penalty for overfitting, is applicable prior to running extensive analyses, and simultaneously compares all models being considered and thus does not rely on a series of pairwise comparisons of models to traverse model space. We evaluate this method by examining four real data sets and by using those data sets to define simulation conditions. In the real data sets, the DT method selects the same or simpler models than conventional LRTs. In order to lend generality to the simulations, codon-based models (with parameters estimated from the real data sets) were used to generate simulated data sets, which are therefore more complex than any of the models we evaluate. On average, the DT method selects models that are simpler than those chosen by conventional LRTs. Nevertheless, these simpler models provide estimates of branch lengths that are more accurate both in terms of relative error and absolute error than those derived using the more complex (yet still wrong) models chosen by conventional LRTs. This method is available in a program called DT-ModSel. (Bayesian model selection; decision theory; incorrect models; likelihood ratio test; maximum likelihood; nucleotide-substitution model; phylogeny.)

421 citations


Journal ArticleDOI
TL;DR: It is shown that the commonly applied leave-one-out cross-validation has a strong tendency to overfitting, underestimates the true prediction error, and should not be used without further constraints or further validation.
Abstract: Different methods of cross-validation are studied for their suitability to guide variable-selection algorithms to yield highly predictive models. It is shown that the commonly applied leave-one-out cross-validation has a strong tendency to overfitting, underestimates the true prediction error, and should not be used without further constraints or further validation. Alternatives to leave-one-out cross-validation and other validation methods are presented.

223 citations


Journal ArticleDOI
TL;DR: A novel approach is suggested, named Decision Forest, that combines multiple Decision Tree models that are of similar predictive quality and quality compared to the individual models is consistently and significantly improved in both training and testing steps.
Abstract: The techniques of combining the results of multiple classification models to produce a single prediction have been investigated for many years. In earlier applications, the multiple models to be combined were developed by altering the training set. The use of these so-called resampling techniques, however, poses the risk of reducing predictivity of the individual models to be combined and/or over fitting the noise in the data, which might result in poorer prediction of the composite model than the individual models. In this paper, we suggest a novel approach, named Decision Forest, that combines multiple Decision Tree models. Each Decision Tree model is developed using a unique set of descriptors. When models of similar predictive quality are combined using the Decision Forest method, quality compared to the individual models is consistently and significantly improved in both training and testing steps. An example will be presented for prediction of binding affinity of 232 chemicals to the estrogen receptor.

202 citations


Journal ArticleDOI
TL;DR: In this article, several types of supervised feed-forward neural networks were investigated in an attempt to identify methods able to relate soil properties and grain yields on a point-by-point basis within ten individual site-years.
Abstract: Understanding the relationships between yield and soil properties and topographic characteristics is of critical importance in precision agriculture. A necessary first step is to identify techniques to reliably quantify the relationships between soil and topographic characteristics and crop yield. Stepwise multiple linear regression (SMLR), projection pursuit regression (PPR), and several types of supervised feed-forward neural networks were investigated in an attempt to identify methods able to relate soil properties and grain yields on a point-by-point basis within ten individual site-years. To avoid overfitting, evaluations were based on predictive ability using a 5-fold cross-validation technique. The neural techniques consistently outperformed both SMLR and PPR and provided minimal prediction errors in every site-year. However, in site-years with relatively fewer observations and in site-years where a single, overriding factor was not apparent, the improvements achieved by neural networks over both SMLR and PPR were small. A second phase of the experiment involved estimation of crop yield across multiple site-years by including climatological data. The ten site-years of data were appended with climatological variables, and prediction errors were computed. The results showed that significant overfitting had occurred and indicated that a much larger number of climatologically unique site-years would be required in this type of analysis.

168 citations


Journal Article
TL;DR: This paper considers the class of Bayesian mixture algorithms, where an estimator is formed by constructing a data-dependent mixture over some hypothesis space, and demonstrates that mixture approaches are particularly robust, and allow for the construction of highly complex estimators, while avoiding undesirable overfitting effects.
Abstract: Bayesian approaches to learning and estimation have played a significant role in the Statistics literature over many years. While they are often provably optimal in a frequentist setting, and lead to excellent performance in practical applications, there have not been many precise characterizations of their performance for finite sample sizes under general conditions. In this paper we consider the class of Bayesian mixture algorithms, where an estimator is formed by constructing a data-dependent mixture over some hypothesis space. Similarly to what is observed in practice, our results demonstrate that mixture approaches are particularly robust, and allow for the construction of highly complex estimators, while avoiding undesirable overfitting effects. Our results, while being data-dependent in nature, are insensitive to the underlying model assumptions, and apply whether or not these hold. At a technical level, the approach applies to unbounded functions, constrained only by certain moment conditions. Finally, the bounds derived can be directly applied to non-Bayesian mixture approaches such as Boosting and Bagging.

157 citations


Journal ArticleDOI
TL;DR: It is shown under general regularity conditions that during the process of AdaBoost a consistent prediction is generated, which has the prediction error approximating the optimal Bayes error as the sample size increases.
Abstract: Recent experiments and theoretical studies show that AdaBoost can overfit in the limit of large time. If running the algorithm forever is suboptimal, a natural question is how low can the prediction error be during the process of AdaBoost? We show under general regularity conditions that during the process of AdaBoost a consistent prediction is generated, which has the prediction error approximating the optimal Bayes error as the sample size increases. This result suggests that, while running the algorithm forever can be suboptimal, it is reasonable to expect that some regularization method via truncation of the process may lead to a near-optimal performance for sufficiently large sample size.

148 citations


Journal ArticleDOI
Abstract: This paper describes our application of conditional random fields with feature induction to a Hindi named entity recognition task. With only five days development time and little knowledge of this language, we automatically discover relevant features by providing a large array of lexical tests and using feature induction to automatically construct the features that most increase conditional likelihood. In an effort to reduce overfitting, we use a combination of a Gaussian prior and early stopping based on the results of 10-fold cross validation.

145 citations


Journal ArticleDOI
TL;DR: In this article, the authors present an application of the measure of total uncertainty on convex sets of probability distributions, also called credal sets, to the construction of classification trees, where the probabilities of the classes in each one of its leaves are estimated by using the imprecise Dirichlet model.
Abstract: We present an application of the measure of total uncertainty on convex sets of probability distributions, also called credal sets, to the construction of classification trees. In these classification trees the probabilities of the classes in each one of its leaves is estimated by using the imprecise Dirichlet model. In this way, smaller samples give rise to wider probability intervals. Branching a classification tree can decrease the entropy associated with the classes but, at the same time, as the sample is divided among the branches the nonspecificity increases. We use a total uncertainty measure (entropy nonspecificity) as branching criterion. The stopping rule is not to increase the total uncertainty. The good behavior of this procedure for the standard classification problems is shown. It is important to remark that it does not experience of overfitting, with similar results in the training and test samples. © 2003 Wiley Periodicals, Inc.

118 citations


Journal ArticleDOI
TL;DR: Despite the considerable flexibility of the family of fractional polynomials and the consequent risk of overfitting when several variables are considered, the multivariable selection algorithm can find stable models.
Abstract: Sauerbrei and Royston have recently described an algorithm, based on fractional polynomials, for the simultaneous selection of variables and of suitable transformations for continuous predictors in a multivariable regression setting. They illustrated the approach by analyses of two breast cancer data sets. Here we extend their work by considering how to assess possible instability in such multivariable fractional polynomial models. We first apply the algorithm repeatedly in many bootstrap replicates. We then use log-linear models to investigate dependencies among the inclusion fractions for each predictor and among the simplified classes of fractional polynomial function chosen in the bootstrap samples. To further evaluate the results, we define measures of instability based on a decomposition of the variability of the bootstrap-selected functions in relation to a reference function from the original model. For each data set we are able to identify large, reasonably stable subsets of the bootstrap replications in which the functional forms of the predictors appear fairly stable. Despite the considerable flexibility of the family of fractional polynomials and the consequent risk of overfitting when several variables are considered, we conclude that the multivariable selection algorithm can find stable models.

Journal ArticleDOI
TL;DR: This work introduces a CART-based approach to discover EPs in microarray data that combines pattern search with a statistical procedure based on Fisher's exact test to assess the significance of each EP and assigns statistical significance to the inferred EPs.
Abstract: Motivation: Cancer diagnosis using gene expression profiles requires supervised learning and gene selection methods. Of the many suggested approaches, the method of emerging patterns (EPs) has the particular advantage of explicitly modeling interactions among genes, which improves classification accuracy. However, finding useful (i.e. short and statistically significant) EP is typically very hard. Methods: Here we introduce a CART-based approach to discover EPs in microarray data. The method is based on growing decision trees from which the EPs are extracted.This approach combines pattern search with a statistical procedure based on Fisher’s exact test to assess the significance of each EP. Subsequently, sample classification based on the inferred EPs is performed using maximum-likelihood linear discriminant analysis. Results: Using simulated data as well as gene expression data from colon and leukemia cancer experiments we assessed the performance of our pattern search algorithm and classification procedure. In the simulations, our method recovers a large proportion of known EPs while for real data it is comparable in classification accuracy with three topperforming alternative classification algorithms. In addition, it assigns statistical significance to the inferred EPs and allows to rank the patterns while simultaneously avoiding overfit of the data. The new approach therefore provides a versatile and computationally fast tool for elucidating local gene interactions as well as for classification. Availability: A computer program written in the statistical language R implementing the new approach is freely available from the web page http://www.stat.uni-muenchen.de/ ~socher/

Proceedings Article
01 Jan 2003
TL;DR: Empirical studies on eight dierent UCI data sets and one text categorization data set show that WeightBoost almost always achieves a considerably better classification accuracy than AdaBoost, and experiments on data with artificially controlled noise indicate that the WeightBoost is more robust to noise than Ada boost.
Abstract: AdaBoost has proved to be an eective method to improve the performance of base classifiers both theoretically and empirically. However, previous studies have shown that AdaBoost might suer from the overfitting problem, especially for noisy data. In addition, most current work on boosting assumes that the combination weights are fixed constants and therefore does not take particular input patterns into consideration. In this paper, we present a new boosting algorithm, “WeightBoost”, which tries to solve these two problems by introducing an inputdependent regularization factor to the combination weight. Similarly to AdaBoost, we derive a learning procedure for WeightBoost, which is guaranteed to minimize training errors. Empirical studies on eight dierent UCI data sets and one text categorization data set show that WeightBoost almost always achieves a considerably better classification accuracy than AdaBoost. Furthermore, experiments on data with artificially controlled noise indicate that the WeightBoost is more robust to noise than AdaBoost.

Journal ArticleDOI
TL;DR: Order-selection criteria for vector autoregressive (AR) modeling are discussed and the combined information criterion (CIC) for vector signals is robust to finite sample effects and has the optimal asymptotic penalty factor.
Abstract: Order-selection criteria for vector autoregressive (AR) modeling are discussed. The performance of an order-selection criterion is optimal if the model of the selected order is the most accurate model in the considered set of estimated models: here vector AR models. Suboptimal performance can be a result of underfit or overfit. The Akaike (1969) information criterion (AIC) is an asymptotically unbiased estimator of the Kullback-Leibler discrepancy (KLD) that can be used as an order-selection criterion. AIC is known to suffer from overfit: The selected model order can be greater than the optimal model order. Two causes of overfit are finite sample effects and asymptotic effects. As a consequence of finite sample effects, AIC underestimates the KLD for higher model orders, leading to overfit. Asymptotically, overfit is the result of statistical variations in the order-selection criterion. To derive an accurate order-selection criterion, both causes of overfit have to be addressed. Moreover, the cost of underfit has to be taken into account. The combined information criterion (CIC) for vector signals is robust to finite sample effects and has the optimal asymptotic penalty factor. This penalty factor is the result of a tradeoff of underfit and overfit. The optimal penalty factor depends on the number of estimated parameters per model order. The CIC is compared to other criteria such as the AIC, the corrected Akaike information criterion (AICc), and the consistent minimum description length (MDL).

Journal ArticleDOI
TL;DR: The authors compare different methods of estimating the term structure of interest rates on a daily UK treasury bill and gilt data that spans the period from January 1995 to January 1999 in-sample and out-of-sample statistics reveal the superior pricing ability of certain methods characterised by an exponential functional form.
Abstract: I compare different methods of estimating the term structure of interest rates on a daily UK treasury bill and gilt data that spans the period from January 1995 to January 1999 In-sample and out-of-sample statistics reveal the superior pricing ability of certain methods characterised by an exponential functional form In addition to these standard goodness of fit statistics, model performance is judged in terms of two trading strategies based on model residuals Both strategies reveal that parsimonious representations of the term structure perform better than their spline counterparts characterised by a linear functional form This is valid even when abnormal returns are adjusted for market movements Linear splines overfit the data and are likely to give misleading results

Proceedings ArticleDOI
03 Nov 2003
TL;DR: This paper explores an efficient extension of the standard Support Vector Machine approach, called SVMC (Support Vector Mapping Convergence) for the TC-WON tasks, and shows that when the positive training data is not too under-sampled, SVMC significantly outperforms other methods.
Abstract: Most existing studies of text classification assume that the training data are completely labeled. In reality, however, many information retrieval problems can be more accurately described as learning a binary classifier from a set of incompletely labeled examples, where we typically have a small number of labeled positive examples and a very large number of unlabeled examples. In this paper, we study such a problem of performing Text Classification WithOut labeled Negative data TC-WON). In this paper, we explore an efficient extension of the standard Support Vector Machine (SVM) approach, called SVMC (Support Vector Mapping Convergence) [17]for the TC-WON tasks. Our analyses show that when the positive training data is not too under-sampled, SVMC significantly outperforms other methods because SVMC basically exploits the natural "gap" between positive and negative documents in the feature space, which eventually corresponds to improving the generalization performance. In the text domain there are likely to exist many gaps in the feature space because a document is usually mapped to a sparse and high dimensional feature space. However, as the number of positive training data decreases, the boundary of SVMC starts overfitting at some point and end up generating very poor results.This is because when the positive training data is too few, the boundary over-iterates and trespasses the natural gaps between positive and negative class in the feature space and thus ends up fitting tightly around the few positive training data.

Journal ArticleDOI
TL;DR: In this paper, a new procedure is presented for wavelength interval selection with a genetic algorithm in order to improve the predictive ability of partial least squares multivariate calibration, which involves separately labelling each of the selected sensor ranges with an appropriate inclusion ranking.
Abstract: A new procedure is presented for wavelength interval selection with a genetic algorithm in order to improve the predictive ability of partial least squares multivariate calibration. It involves separately labelling each of the selected sensor ranges with an appropriate inclusion ranking. The new approach intends to alleviate overfitting without the need of preparing an independent monitoring sample set. A theoretical example is worked out in order to compare the performance of the new approach with previous implementations of genetic algorithms. Two experimental data sets are also studied: target parameters are the concentration of glucuronic acid in complex mixtures studied by Fourier transform mid-infrared spectroscopy and the octane number in gasolines monitored by near-infrared spectroscopy. Copyright © 2003 John Wiley & Sons, Ltd.

Book ChapterDOI
11 Jun 2003
TL;DR: This paper considers three of the best-known boosting algorithms: Adaboost, Logitboost and Brownboost, which are adaptive, and work by maintaining a set of example and class weights which focus the attention of a base learner on the examples that are hardest to classify.
Abstract: Boosting algorithms are a means of building a strong ensemble classifier by aggregating a sequence of weak hypotheses. In this paper we consider three of the best-known boosting algorithms: Adaboost [9], Logitboost [11] and Brownboost [8]. These algorithms are adaptive, and work by maintaining a set of example and class weights which focus the attention of a base learner on the examples that are hardest to classify. We conduct an empirical study to compare the performance of these algorithms, measured in terms of overall test error rate, on five real data sets. The tests consist of a series of cross-validatory samples. At each validation, we set aside one third of the data chosen at random as a test set, and fit the boosting algorithm to the remaining two thirds, using binary stumps as a base learner. At each stage we record the final training and test error rates, and report the average errors within a 95% confidence interval. We then add artificial class noise to our data sets by randomly reassigning 20% of class labels, and repeat our experiment. We find that Brownboost and Logitboost prove less likely than Adaboost to overfit in this circumstance.

Journal ArticleDOI
TL;DR: In this paper, two nonlinear models, nonlinear prediction (NLP) and artificial neural networks (ANN), are compared for multivariate flood forecasting, and very good results are obtained with the two methods: NLP performs slightly better at short forecast times while the situation is reversed for longer times.
Abstract: [1] Two nonlinear models, nonlinear prediction (NLP) and artificial neural networks (ANN), are compared for multivariate flood forecasting. For NLP the calibration of the locally linear model is quite simple, while for ANN the validation and identification of the model can be cumbersome, mainly because of overfitting. Very good results are obtained with the two methods: NLP performs slightly better at short forecast times while the situation is reversed for longer times.

Journal ArticleDOI
TL;DR: A variational Bayesian method is developed to perform independent component analysis (ICA) on high-dimensional data containing missing entries and yields an accurate density model for the observed data without overfitting problems.
Abstract: Missing data are common in real-world data sets and are a problem for many estimation techniques. We have developed a variational Bayesian method to perform independent component analysis (ICA) on high-dimensional data containing missing entries. Missing data are handled naturally in the Bayesian framework by integrating the generative density model. Modeling the distributions of the independent sources with mixture of gaussians allows sources to be estimated with different kurtosis and skewness. Unlike the maximum likelihood approach, the variational Bayesian method automatically determines the dimensionality of the data and yields an accurate density model for the observed data without overfitting problems. The technique is also extended to the clusters of ICA and supervised classification framework.

Book ChapterDOI
12 Jul 2003
TL;DR: Page-based Linear Genetic Programming is proposed and implemented with two-layer Subset Selection with careful adjustment of the relationship between subset layers to address a two-class intrusion detection classification problem as defined by the KDD-99 benchmark dataset.
Abstract: Page-based Linear Genetic Programming (GP) is proposed and implemented with two-layer Subset Selection to address a two-class intrusion detection classification problem as defined by the KDD-99 benchmark dataset. By careful adjustment of the relationship between subset layers, over fitting by individuals to specific subsets is avoided. Moreover, efficient training on a dataset of 500,000 patterns is demonstrated. Unlike the current approaches to this benchmark, the learning algorithm is also responsible for deriving useful temporal features. Following evolution, decoding of a GP individual demonstrates that the solution is unique and comparative to hand coded solutions found by experts.

Book ChapterDOI
11 Jun 2003
TL;DR: The generalization error of BKS method is analysed, and a simple analytical model that relates error to sample size is proposed and a strategy for improving performances by using linear classifiers in "ambiguous" cells of B KS table is described.
Abstract: In the pattern recognition literature, Huang and Suen introduced the "multinomial" rule for fusion of multiple classifiers under the name of Behavior Knowledge Space (BKS) method [1]. This classifier fusion method can provide very good performances if large and representative data sets are available. Otherwise over fitting is likely to occur, and the generalization error quickly increases. In spite of this crucial small sample size problem, analytical models of BKS generalization error are currently not available. In this paper, the generalization error of BKS method is analysed, and a simple analytical model that relates error to sample size is proposed. In addition, a strategy for improving performances by using linear classifiers in "ambiguous" cells of BKS table is described. Preliminary experiments on synthetic and real data sets are reported.

Journal ArticleDOI
TL;DR: A hybrid back‐propagation algorithm has been developed that utilizes the genetic algorithms search technique and the Bayesian neural network methodology to overcome the shortcomings of the conventional back‐ Propagation neural network.
Abstract: There is growing interest in the use of back-propagation neural networks to model non-linear multivariate problems in geotehnical engineering. To overcome the shortcomings of the conventional back-propagation neural network, such as overfitting, where the neural network learns the spurious details and noise in the training examples, a hybrid back-propagation algorithm has been developed. The method utilizes the genetic algorithms search technique and the Bayesian neural network methodology. The genetic algorithms enhance the stochastic search to locate the global minima for the neural network model. The Bayesian inference procedures essentially provide better generalization and a statistical approach to deal with data uncertainty in comparison with the conventional back-propagation. The uncertainty of data can be indicated using error bars. Two examples are presented to demonstrate the convergence and generalization capabilities of this hybrid algorithm. Copyright © 2003 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: A variational method to automatically determine the number of mixtures of independent components in high-dimensional datasets, in which the sources may be nonsymmetrically distributed, was successfully applied to a difficult real-world medical dataset for diagnosing glaucoma.
Abstract: We apply a variational method to automatically determine the number of mixtures of independent components in high-dimensional datasets, in which the sources may be nonsymmetrically distributed. The data are modeled by clusters where each cluster is described as a linear mixture of independent factors. The variational Bayesian method yields an accurate density model for the observed data without overfitting problems. This allows the dimensionality of the data to be identified for each cluster. The new method was successfully applied to a difficult real-world medical dataset for diagnosing glaucoma.

Proceedings ArticleDOI
02 Nov 2003
TL;DR: This paper discusses an application of Random Fields to the problem of creating accurate yet flexible statistical models of polyphonic music, and shows that random fields not only outperform Markov chains, but are much more robust in terms of overfitting.
Abstract: Recent interest in the area of music information retrieval and related technologies is exploding. However, very few of the existing techniques take advantage of recent developments in statistical modeling. In this paper we discuss an application of Random Fields to the problem of creating accurate yet flexible statistical models of polyphonic music. With such models in hand, the challenges of developing effective searching, browsing and organization techniques for the growing bodies of music collections may be successfully met. We offer an evaluation of these models in terms of perplexity and prediction accuracy, and show that random fields not only outperform Markov chains, but are much more robust in terms of overfitting.

Journal ArticleDOI
TL;DR: A powerful method for mobility spectrum analysis is presented, based on Bryan's maximum entropy algorithm, which is fast, and allows the analysis of large quantities of data, removing the bias of data selection inherent in all previous techniques.
Abstract: A powerful method for mobility spectrum analysis is presented, based on Bryan’s maximum entropy algorithm. The Bayesian analysis central to Bryan’s algorithm ensures that we avoid overfitting of data, resulting in a physically reasonable solution. The algorithm is fast, and allows the analysis of large quantities of data, removing the bias of data selection inherent in all previous techniques. Existing mobility spectrum analysis systems are reviewed, and the performance of the Bryan’s algorithm mobility spectrum (BAMS) approach is demonstrated using synthetic data sets. Analysis of experimental data is briefly discussed. We find that BAMS performs well compared to existing mobility spectrum methods.

01 Jan 2003
TL;DR: Empirical evidence is shown that, in spite of the high danger of overfitting, non-linear methods can outperform linear methods, both in performance and number of features selected.
Abstract: We address problems of classification in which the number of input components (variables, features) is very large compared to the number of training samples. In this setting, it is often desirable to perform a feature selection to reduce the number of inputs, either for efficiency, performance, or to gain understanding of the data and the classifiers. We compare a number of methods on mass-spectrometric data of human protein sera from asymptomatic patients and prostate cancer patients. We show empirical evidence that, in spite of the high danger of overfitting, non-linear methods can outperform linear methods, both in performance and number of features selected.

Journal ArticleDOI
TL;DR: The Monte Carlo Cross-Validation (MCCV) and the PoLiSh smoothed regression are used and compared with the better known adjusted Wold's R criterion.

Book ChapterDOI
08 Apr 2003
TL;DR: The effect of the three-objective formulation of genetic rule selection on the generalization ability of obtained rule sets is examined through computer simulations where many non-dominated rule sets are generated using an EMO algorithm for a number of high-dimensional pattern classification problems.
Abstract: One advantage of evolutionary multiobjective optimization (EMO) algorithms over classical approaches is that many non-dominated solutions can be simultaneously obtained by their single run. This paper shows how this advantage can be utilized in genetic rule selection for the design of fuzzy rule-based classification systems. Our genetic rule selection is a two-stage approach. In the first stage, a pre-specified number of candidate rules are extracted from numerical data using a data mining technique. In the second stage, an EMO algorithm is used for finding non-dominated rule sets with respect to three objectives: to maximize the number of correctly classified training patterns, to minimize the number of rules, and to minimize the total rule length. Since the first objective is measured on training patterns, the evolution of rule sets tends to overfit to training patterns. The question is whether the other two objectives work as a safeguard against the overfitting. In this paper, we examine the effect of the three-objective formulation on the generalization ability (i.e., classification rates on test patterns) of obtained rule sets through computer simulations where many non-dominated rule sets are generated using an EMO algorithm for a number of high-dimensional pattern classification problems.

Journal ArticleDOI
TL;DR: In this article, the authors present an approach to estimate the reservoir rock properties from seismic data through the use of regularized back propagation networks that have inherent smoothness characteristics, which alleviates the nonmonotonous generalization problem associated with traditional networks and helps to avoid overfitting the data.
Abstract: The performance of traditional back-propagation networks for reservoir characterization in production settings has been inconsistent due to their nonmonotonous generalization, which necessitates extensive tweaking of their parameters in order to achieve satisfactory results and avoid overfitting the data. This makes the accuracy of these networks sensitive to the selection of the network parameters. We present an approach to estimate the reservoir rock properties from seismic data through the use of regularized back propagation networks that have inherent smoothness characteristics. This approach alleviates the nonmonotonous generalization problem associated with traditional networks and helps to avoid overfitting the data. We apply the approach to a 3D seismic survey in the Shedgum area of Ghawar field, Saudi Arabia, to estimate the reservoir porosity distribution of the Arab-D zone, and we contrast the accuracy of our approach with that of traditional back-propagation networks through cross-validation tests. The results of these tests indicate that the accuracy of our approach remains consistent as the network parameters are varied, whereas that of the traditional network deteriorates as soon as deviations from the optimal parameters occur. The approach we present thus leads to more robust estimates of the reservoir properties and requires little or no tweaking of the network parameters to achieve optimal results.