scispace - formally typeset
Search or ask a question
Posted Content

Is the new model better? One metric says yes, but the other says no. Which metric do I use?

TL;DR: This article examines the analytical connections and differences between two IncV metrics: Incv in AUC (IncV-AUC) and IncV in AP ( IncV-AP), and compares them with a strictly proper scoring rule: the IncV of the scaled Brier score (incV-sBrS), via a numerical study.
Abstract: Incremental value (IncV) evaluates the performance change from an existing risk model to a new model. It is one of the key considerations in deciding whether a new risk model performs better than the existing one. Problems arise when different IncV metrics contradict each other. For example, compared with a prescribed-dose model, an ovarian-dose model for predicting acute ovarian failure has a slightly lower area under the receiver operating characteristic curve (AUC) but increases the area under the precision-recall curve (AP) by 48%. This phenomenon of conflicting conclusions is not uncommon, and it creates a dilemma in medical decision making. In this article, we examine the analytical connections and differences between two IncV metrics: IncV in AUC (IncV-AUC) and IncV in AP (IncV-AP). Additionally, since they are both semi-proper scoring rules, we compare them with a strictly proper scoring rule: the IncV of the scaled Brier score (IncV-sBrS), via a numerical study. We demonstrate that both IncV-AUC and IncV-AP are weighted averages of the changes (from the existing model to the new one) in separating the risk score distributions between events and non-events. However, IncV-AP assigns heavier weights to the changes in the high-risk group, whereas IncV-AUC weights the changes equally. In the numerical study, we find that IncV-AP has a wide range, from negative to positive, but the size of IncV-AUC is much smaller. In addition, IncV-AP and IncV-sBr Sare highly consistent, but IncV-AUC is negatively correlated with IncV-sBrS and IncV-AP at a low event rate. IncV-AUC and IncV-AP are the least consistent among the three pairs, and their differences are more pronounced as the event rate decreases.
References
More filters
Book ChapterDOI
TL;DR: The analysis of censored failure times is considered in this paper, where the hazard function is taken to be a function of the explanatory variables and unknown regression coefficients multiplied by an arbitrary and unknown function of time.
Abstract: The analysis of censored failure times is considered. It is assumed that on each individual arc available values of one or more explanatory variables. The hazard function (age-specific failure rate) is taken to be a function of the explanatory variables and unknown regression coefficients multiplied by an arbitrary and unknown function of time. A conditional likelihood is obtained, leading to inferences about the unknown regression coefficients. Some generalizations are outlined.

28,264 citations

Book
28 May 1999
TL;DR: This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear and provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations.
Abstract: Statistical approaches to processing natural language text have become dominant in recent years This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear The book contains all the theory and algorithms needed for building NLP tools It provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations The book covers collocation finding, word sense disambiguation, probabilistic parsing, information retrieval, and other applications

9,295 citations


"Is the new model better? One metric..." refers methods in this paper

  • ...Originated from the information retrieval community in the 1980s (Raghavan et al., 1989; Manning and Schütze, 1999), it is a relatively new tool in medical research....

    [...]

Journal ArticleDOI
TL;DR: Receiver-operating characteristic (ROC) plots provide a pure index of accuracy by demonstrating the limits of a test's ability to discriminate between alternative states of health over the complete spectrum of operating conditions.
Abstract: The clinical performance of a laboratory test can be described in terms of diagnostic accuracy, or the ability to correctly classify subjects into clinically relevant subgroups. Diagnostic accuracy refers to the quality of the information provided by the classification device and should be distinguished from the usefulness, or actual practical value, of the information. Receiver-operating characteristic (ROC) plots provide a pure index of accuracy by demonstrating the limits of a test's ability to discriminate between alternative states of health over the complete spectrum of operating conditions. Furthermore, ROC plots occupy a central or unifying position in the process of assessing and using diagnostic tools. Once the plot is generated, a user can readily go on to many other activities such as performing quantitative ROC analysis and comparisons of tests, using likelihood ratio to revise the probability of disease in individual subjects, selecting decision thresholds, using logistic-regression analysis, using discriminant-function analysis, or incorporating the tool into a clinical strategy by using decision analysis.

6,339 citations


"Is the new model better? One metric..." refers methods in this paper

  • ...In medical research, the receiver operating characteristic (ROC) curve has been the most popular tool for model evaluation, dating back to the 1960s when it was applied in diagnostic radiology and imaging systems (Zweig and Campbell, 1993; Pepe, 2003)....

    [...]

Journal ArticleDOI
TL;DR: It is suggested that reporting discrimination and calibration will always be important for a prediction model and decision-analytic measures should be reported if the predictive model is to be used for clinical decisions.
Abstract: The performance of prediction models can be assessed using a variety of methods and metrics. Traditional measures for binary and survival outcomes include the Brier score to indicate overall model performance, the concordance (or c) statistic for discriminative ability (or area under the receiver operating characteristic [ROC] curve), and goodness-of-fit statistics for calibration.Several new measures have recently been proposed that can be seen as refinements of discrimination measures, including variants of the c statistic for survival, reclassification tables, net reclassification improvement (NRI), and integrated discrimination improvement (IDI). Moreover, decision-analytic measures have been proposed, including decision curves to plot the net benefit achieved by making decisions based on model predictions.We aimed to define the role of these relatively novel approaches in the evaluation of the performance of prediction models. For illustration, we present a case study of predicting the presence of residual tumor versus benign tissue in patients with testicular cancer (n = 544 for model development, n = 273 for external validation).We suggest that reporting discrimination and calibration will always be important for a prediction model. Decision-analytic measures should be reported if the predictive model is to be used for clinical decisions. Other measures of performance may be warranted in specific applications, such as reclassification metrics to gain insight into the value of adding a novel predictor to an established model.

3,473 citations


"Is the new model better? One metric..." refers background in this paper

  • ...A scaled Brier score (sBrS) is defined as sBrS = 1 − BrS/[π(1 − π)], ranging from 0 and 1, with larger values indicating better performance (Steyerberg et al., 2010)....

    [...]

Book
13 Mar 2003
TL;DR: A comparison of Binary Tests and Regression Analysis and the Receiver Operating Characteristic Curve shows that Binary Tests are more accurate than Ordinal Tests when the Receiver operating characteristic curve is considered.
Abstract: 1. Introduction 2. Measures of Accuracy for Binary Tests 3. Comparing Binary Tests and Regression Analysis 4. The Receiver Operating Characteristic Curve 5. Estimating the ROC Curve 6. Covariate Effects on Continuous and Ordinal Tests 7. Incomplete Data and Imperfect Reference Tests 8. Study Design and Hypothesis Testing 9. More Topics and Conclusions References/Bibliography Index

2,289 citations


"Is the new model better? One metric..." refers methods in this paper

  • ...In medical research, the receiver operating characteristic (ROC) curve has been the most popular tool for model evaluation, dating back to the 1960s when it was applied in diagnostic radiology and imaging systems (Zweig and Campbell, 1993; Pepe, 2003)....

    [...]