scispace - formally typeset
Journal ArticleDOI

The problem of overfitting.

Reads0
Chats0
TLDR
The focus is on regression problems, which are those in which one of the measures, the dependent Variable, is of special interest, and the authors wish to explore its relationship with the other variables.
Abstract
Model fitting is an important part of all sciences that use quantitative measurements. Experimenters often explore the relationships between measures. Two subclasses of relationship problems are as follows: • Correlation problems: those in which we have a collection of measures, all of interest in their own right, and wish to see how and how strongly they are related. • Regression problems : those in which one of the measures, the dependent Variable, is of special interest, and we wish to explore its relationship with the other variables. These other variables may be called the independent Variables, the predictor Variables, or the coVariates. The dependent variable may be a continuous numeric measure such as a boiling point or a categorical measure such as a classification into mutagenic and nonmutagenic. We should emphasize that using the words ‘correlation problem’ and ‘regression problem’ is not meant to tie these problems to any particular statistical methodology. Having a ‘correlation problem’ does not limit us to conventional Pearson correlation coefficients. Log-linear models, for example, measure the relationship between categorical variables in multiway contingency tables. Similarly, multiple linear regression is a methodology useful for regression problems, but so also are nonlinear regression, neural nets, recursive partitioning and k-nearest neighbors, logistic regression, support vector machines and discriminant analysis, to mention a few. All of these methods aim to quantify the relationship between the predictors and the dependent variable. We will use the term ‘regression problem’ in this conceptual form and, when we want to specialize to multiple linear regression using ordinary least squares, will describe it as ‘OLS regression’. Our focus is on regression problems. We will use y as shorthand for the dependent variable and x for the collection of predictors available. There are two distinct primary settings in which we might want to do a regression study: • Prediction problems:We may want to make predictions of y for future cases where we know x but do not knowy. This for example is the problem faced with the Toxic Substances Control Act (TSCA) list. This list contains many tens of thousands of compounds, and there is a need to identify those on the list that are potentially harmful. Only a small fraction of the list however has any measured biological properties, but all of them can be characterized by chemical descriptors with relative ease. Using quantitative structure-activity relationships (QSARs) fitted to this small fraction to predict the toxicities of the much larger collection is a potentially cost-effective way to try to sort the TSCA compounds by their potential for harm. Later, we will use a data set for predicting the boiling point of a set of compounds on the TSCA list from some molecular descriptors. • Effect quantification:We may want to gain an understanding of how the predictors enter into the relationship that predicts y. We do not necessarily have candidate future unknowns that we want to predict, we simply want to know how each predictor drives the distribution of y. This is the setting seen in drug discovery, where the biological activity y of each in a collection of compounds is measured, along with molecular descriptors x. Finding out which descriptors x are associated with high and which with low biological activity leads to a recipe for new compounds which are high in the features associated positively with activity and low in those associated with inactivity or with adverse side effects. These two objectives are not always best served by the same approaches. ‘Feature selection’ skeeping those features associated withy and ignoring those not associated with y is very commonly a part of an analysis meant for effect quantification but is not necessarily helpful if the objective is prediction of future unknowns. For prediction, methods such as partial least squares (PLS) and ridge regression (RR) that retain all features but rein in their contributions are often found to be more effective than those relying on feature selection. What Is Overfitting? Occam’s Razor, or the principle of parsimony, calls for using models and procedures that contain all that is necessary for the modeling but nothing more. For example, if a regression model with 2 predictors is enough to explainy, then no more than these two predictors should be used. Going further, if the relationship can be captured by a linear function in these two predictors (which is described by 3 numbers sthe intercept and two slopes), then using a quadratic violates parsimony. Overfitting is the use of models or procedures that violate parsimonysthat is, that include more terms than are necessary or use more complicated approaches than are necessary. It is helpful to distinguish two types of overfitting: • Using a model that is more flexible than it needs to be. For example, a neural net is able to accommodate some curvilinear relationships and so is more flexible than a simple linear regression. But if it is used on a data set that conforms to the linear model, it will add a level of complexity without * Corresponding author e-mail: doug@stat.umn.edu. 1 J. Chem. Inf. Comput. Sci. 2004,44, 1-12

read more

Citations
More filters
Book

Applied Predictive Modeling

Max Kuhn, +1 more
TL;DR: This research presents a novel and scalable approach called “Smartfitting” that automates the very labor-intensive and therefore time-heavy and therefore expensive and expensive process of designing and implementing statistical models for regression models.
Journal ArticleDOI

Principles of QSAR models validation: internal and external

TL;DR: Evidence is presented that only models that have been validated externally, after their internal validation, can be considered reliable and applicable for both external prediction and regulatory purposes.
Journal ArticleDOI

Covid-19: automatic detection from X-ray images utilizing transfer learning with convolutional neural networks.

TL;DR: The results suggest that Deep Learning with X-ray imaging may extract significant biomarkers related to the Covid-19 disease, while the best accuracy, sensitivity, and specificity obtained is 96.78%, 98.66%, and 96.46% respectively.
Journal ArticleDOI

Deep Neural Networks Based Recognition of Plant Diseases by Leaf Image Classification

TL;DR: A new approach to the development of plant disease recognition model, based on leaf image classification, by the use of deep convolutional networks, which is able to recognize 13 different types of plant diseases out of healthy leaves.
References
More filters
Journal ArticleDOI

The Elements of Statistical Learning

Eric R. Ziegel
- 01 Aug 2003 - 
TL;DR: Chapter 11 includes more case studies in other areas, ranging from manufacturing to marketing research, and a detailed comparison with other diagnostic tools, such as logistic regression and tree-based methods.
Book

The jackknife, the bootstrap, and other resampling plans

Bradley Efron
TL;DR: The Delta Method and the Influence Function Cross-Validation, Jackknife and Bootstrap Balanced Repeated Replication (half-sampling) Random Subsampling Nonparametric Confidence Intervals as mentioned in this paper.
Journal ArticleDOI

A Statistical View of Some Chemometrics Regression Tools

TL;DR: In this article, the authors examined partial least squares and principal components regression from a statistical perspective and compared them with other statistical methods intended for those situations, such as variable subset selection and ridge regression.
Book

Methods for Statistical Data Analysis of Multivariate Observations

TL;DR: In this paper, the authors present an assessment of specific aspects of multivariate statistical models, including reduction of dimensionality, reduction of dependence, and clustering of multidimensional dependencies.
Related Papers (5)