scispace - formally typeset
Search or ask a question
Author

Davide Ballabio

Other affiliations: University of Milan
Bio: Davide Ballabio is an academic researcher from University of Milano-Bicocca. The author has contributed to research in topics: Artificial neural network & Applicability domain. The author has an hindex of 31, co-authored 110 publications receiving 4453 citations. Previous affiliations of Davide Ballabio include University of Milan.


Papers
More filters
Journal ArticleDOI
TL;DR: The common steps to calibrate and validate classification models based on partial least squares discriminant analysis are discussed in the present tutorial, and issues to be evaluated during model training and validation are introduced and explained using a chemical dataset.
Abstract: The common steps to calibrate and validate classification models based on partial least squares discriminant analysis are discussed in the present tutorial. All issues to be evaluated during model training and validation are introduced and explained using a chemical dataset, composed of toxic and non-toxic sediment samples. The analysis was carried out with MATLAB routines, which are available in the ESI of this tutorial, together with the dataset and a detailed list of all MATLAB instructions used for the analysis.

847 citations

Journal ArticleDOI
TL;DR: The problem of evaluating the predictive ability of QSAR models is dealt with and two formulas for calculating the predictive squared correlation coefficient Q2 are evaluated, one based on SS referring to mean deviations of observed values from the training set mean over theTraining set instead of the external evaluation set.
Abstract: This paper deals with the problem of evaluating the predictive ability of QSAR models and continues the discussion about proper estimates of the predictive ability from an external evaluation set reported in Schuurmann G., Ebert R.-U., et al. External Validation and Prediction Employing the Predictive Squared Correlation Coefficient - Test Set Activity Mean vs Training Set Activity Mean. J. Chem. Inf. Model. 2008, 48, 2140−2145. The two formulas for calculating the predictive squared correlation coefficient Q2 previously discussed by Schuurmann et al. are one that adopted by the current OECD guidelines about QSAR validation and based on SS (sum of squares) of the external test set referring to the training set response mean and the other based on SS of the external test set referring to the test set response mean. In addition to these two formulas, another formula is evaluated here, based on SS referring to mean deviations of observed values from the training set mean over the training set instead of the ...

457 citations

Journal ArticleDOI
TL;DR: Some existing descriptor-based approaches performing this task of characterization of interpolation space in QSAR models are discussed and compared by implementing them on existing validated datasets from the literature.
Abstract: One of the OECD principles for model validation requires defining the Applicability Domain (AD) for the QSAR models. This is important since the reliable predictions are generally limited to query chemicals structurally similar to the training compounds used to build the model. Therefore, characterization of interpolation space is significant in defining the AD and in this study some existing descriptor-based approaches performing this task are discussed and compared by implementing them on existing validated datasets from the literature. Algorithms adopted by different approaches allow defining the interpolation space in several ways, while defined thresholds contribute significantly to the extrapolations. For each dataset and approach implemented for this study, the comparison analysis was carried out by considering the model statistics and relative position of test set with respect to the training space.

369 citations

Journal ArticleDOI
TL;DR: Different functions for calculating the predictive squared correlation coefficient Q2 from an external set were proposed, which lead to occasionally different estimates of the model predictive ability and therefore to contrasting decisions about model adequacy.
Abstract: This paper deals with the problem of evaluating the predictive ability of regression models. In some cases, model validation by internal cross-validation technique is not enough and validation by an external test set has been suggested as an effective way of evaluating the model predictive ability. Different functions for calculating the predictive squared correlation coefficient Q2 from an external set were proposed, which lead to occasionally different estimates of the model predictive ability and therefore to contrasting decisions about model adequacy. In this paper, advantages and drawbacks of these functions in estimating model predictive ability from some simulated datasets are discussed by comparison. Copyright © 2010 John Wiley & Sons, Ltd.

302 citations

Journal ArticleDOI
TL;DR: To build QSAR models to predict ready biodegradation of chemicals by using different modeling methods and types of molecular descriptors, particular attention was given to data screening and validation procedures in order to build predictive models.
Abstract: The European REACH regulation requires information on ready biodegradation, which is a screening test to assess the biodegradability of chemicals. At the same time REACH encourages the use of alternatives to animal testing which includes predictions from quantitative structure–activity relationship (QSAR) models. The aim of this study was to build QSAR models to predict ready biodegradation of chemicals by using different modeling methods and types of molecular descriptors. Particular attention was given to data screening and validation procedures in order to build predictive models. Experimental values of 1055 chemicals were collected from the webpage of the National Institute of Technology and Evaluation of Japan (NITE): 837 and 218 molecules were used for calibration and testing purposes, respectively. In addition, models were further evaluated using an external validation set consisting of 670 molecules. Classification models were produced in order to discriminate biodegradable and nonbiodegradable ch...

195 citations


Cited by
More filters
Christopher M. Bishop1
01 Jan 2006
TL;DR: Probability distributions of linear models for regression and classification are given in this article, along with a discussion of combining models and combining models in the context of machine learning and classification.
Abstract: Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.

10,141 citations

01 Jan 2014
TL;DR: These standards of care are intended to provide clinicians, patients, researchers, payors, and other interested individuals with the components of diabetes care, treatment goals, and tools to evaluate the quality of care.
Abstract: XI. STRATEGIES FOR IMPROVING DIABETES CARE D iabetes is a chronic illness that requires continuing medical care and patient self-management education to prevent acute complications and to reduce the risk of long-term complications. Diabetes care is complex and requires that many issues, beyond glycemic control, be addressed. A large body of evidence exists that supports a range of interventions to improve diabetes outcomes. These standards of care are intended to provide clinicians, patients, researchers, payors, and other interested individuals with the components of diabetes care, treatment goals, and tools to evaluate the quality of care. While individual preferences, comorbidities, and other patient factors may require modification of goals, targets that are desirable for most patients with diabetes are provided. These standards are not intended to preclude more extensive evaluation and management of the patient by other specialists as needed. For more detailed information, refer to Bode (Ed.): Medical Management of Type 1 Diabetes (1), Burant (Ed): Medical Management of Type 2 Diabetes (2), and Klingensmith (Ed): Intensive Diabetes Management (3). The recommendations included are diagnostic and therapeutic actions that are known or believed to favorably affect health outcomes of patients with diabetes. A grading system (Table 1), developed by the American Diabetes Association (ADA) and modeled after existing methods, was utilized to clarify and codify the evidence that forms the basis for the recommendations. The level of evidence that supports each recommendation is listed after each recommendation using the letters A, B, C, or E.

9,618 citations

Journal Article
TL;DR: A case study explores the background of the digitization project, the practices implemented, and the critiques of the project, which aims to provide access to a plethora of information to EPA employees, scientists, and researchers.
Abstract: The Environmental Protection Agency (EPA) provides access to information on a variety of topics related to the environment and strives to inform citizens of health risks. The EPA also has an extensive library network that consists of 26 libraries throughout the United States, which provide access to a plethora of information to EPA employees, scientists, and researchers. The EPA implemented a reorganization project to digitize their materials so they would be more accessible to a wider range of users, but this plan was drastically accelerated when the EPA was threatened with a budget cut. It chose to close and reduce the hours and services of some of their libraries. As a result, the agency was accused of denying users the “right to know” by making information unavailable, not providing an adequate strategic plan, and discarding vital materials. This case study explores the background of the digitization project, the practices implemented, and the critiques of the project.

2,588 citations

Journal ArticleDOI
TL;DR: This article shows how MCC produces a more informative and truthful score in evaluating binary classifications than accuracy and F1 score, by first explaining the mathematical properties, and then the asset of MCC in six synthetic use cases and in a real genomics scenario.
Abstract: To evaluate binary classifications and their confusion matrices, scientific researchers can employ several statistical rates, accordingly to the goal of the experiment they are investigating. Despite being a crucial issue in machine learning, no widespread consensus has been reached on a unified elective chosen measure yet. Accuracy and F1 score computed on confusion matrices have been (and still are) among the most popular adopted metrics in binary classification tasks. However, these statistical measures can dangerously show overoptimistic inflated results, especially on imbalanced datasets. The Matthews correlation coefficient (MCC), instead, is a more reliable statistical rate which produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset. In this article, we show how MCC produces a more informative and truthful score in evaluating binary classifications than accuracy and F1 score, by first explaining the mathematical properties, and then the asset of MCC in six synthetic use cases and in a real genomics scenario. We believe that the Matthews correlation coefficient should be preferred to accuracy and F1 score in evaluating binary classification tasks by all scientific communities.

2,358 citations

Journal ArticleDOI
TL;DR: The paper focuses on the use of principal component analysis in typical chemometric areas but the results are generally applicable.
Abstract: Principal component analysis is one of the most important and powerful methods in chemometrics as well as in a wealth of other areas. This paper provides a description of how to understand, use, and interpret principal component analysis. The paper focuses on the use of principal component analysis in typical chemometric areas but the results are generally applicable.

1,622 citations