scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Assessing the performance of prediction models: a framework for traditional and novel measures.

TL;DR: It is suggested that reporting discrimination and calibration will always be important for a prediction model and decision-analytic measures should be reported if the predictive model is to be used for clinical decisions.
Abstract: The performance of prediction models can be assessed using a variety of methods and metrics. Traditional measures for binary and survival outcomes include the Brier score to indicate overall model performance, the concordance (or c) statistic for discriminative ability (or area under the receiver operating characteristic [ROC] curve), and goodness-of-fit statistics for calibration.Several new measures have recently been proposed that can be seen as refinements of discrimination measures, including variants of the c statistic for survival, reclassification tables, net reclassification improvement (NRI), and integrated discrimination improvement (IDI). Moreover, decision-analytic measures have been proposed, including decision curves to plot the net benefit achieved by making decisions based on model predictions.We aimed to define the role of these relatively novel approaches in the evaluation of the performance of prediction models. For illustration, we present a case study of predicting the presence of residual tumor versus benign tissue in patients with testicular cancer (n = 544 for model development, n = 273 for external validation).We suggest that reporting discrimination and calibration will always be important for a prediction model. Decision-analytic measures should be reported if the predictive model is to be used for clinical decisions. Other measures of performance may be warranted in specific applications, such as reclassification metrics to gain insight into the value of adding a novel predictor to an established model.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: In virtually all medical domains, diagnostic and prognostic multivariable prediction models are being developed, validated, updated, and implemented with the aim to assist doctors and individuals in estimating probabilities and potentially influence their decision making.
Abstract: The TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) Statement includes a 22-item checklist, which aims to improve the reporting of studies developing, validating, or updating a prediction model, whether for diagnostic or prognostic purposes. The TRIPOD Statement aims to improve the transparency of the reporting of a prediction model study regardless of the study methods used. This explanation and elaboration document describes the rationale; clarifies the meaning of each item; and discusses why transparent reporting is important, with a view to assessing risk of bias and clinical usefulness of the prediction model. Each checklist item of the TRIPOD Statement is explained in detail and accompanied by published examples of good reporting. The document also provides a valuable reference of issues to consider when designing, conducting, and analyzing prediction model studies. To aid the editorial process and help peer reviewers and, ultimately, readers and systematic reviewers of prediction model studies, it is recommended that authors include a completed checklist in their submission. The TRIPOD checklist can also be downloaded from www.tripod-statement.org.

2,982 citations

Journal ArticleDOI
TL;DR: Radiomics, the high-throughput mining of quantitative image features from standard-of-care medical imaging that enables data to be extracted and applied within clinical-decision support systems to improve diagnostic, prognostic, and predictive accuracy, is gaining importance in cancer research as mentioned in this paper.
Abstract: Radiomics, the high-throughput mining of quantitative image features from standard-of-care medical imaging that enables data to be extracted and applied within clinical-decision support systems to improve diagnostic, prognostic, and predictive accuracy, is gaining importance in cancer research. Radiomic analysis exploits sophisticated image analysis tools and the rapid development and validation of medical imaging data that uses image-based signatures for precision diagnosis and treatment, providing a powerful tool in modern medicine. Herein, we describe the process of radiomics, its pitfalls, challenges, opportunities, and its capacity to improve clinical decision making, emphasizing the utility for patients with cancer. Currently, the field of radiomics lacks standardized evaluation of both the scientific integrity and the clinical relevance of the numerous published radiomics investigations resulting from the rapid growth of this area. Rigorous evaluation criteria and reporting guidelines need to be established in order for radiomics to mature as a discipline. Herein, we provide guidance for investigations to meet this urgent need in the field of radiomics.

2,730 citations

Journal ArticleDOI
TL;DR: Net reclassification improvement offers a simple intuitive way of quantifying improvement offered by new markers and has been gaining popularity among researchers, however, several aspects of the NRI have not been studied in sufficient detail.
Abstract: Appropriate quantification of added usefulness offered by new markers included in risk prediction algorithms is a problem of active research and debate. Standard methods, including statistical significance and c statistic are useful but not sufficient. Net reclassification improvement (NRI) offers a simple intuitive way of quantifying improvement offered by new markers and has been gaining popularity among researchers. However, several aspects of the NRI have not been studied in sufficient detail. In this paper we propose a prospective formulation for the NRI which offers immediate application to survival and competing risk data as well as allows for easy weighting with observed or perceived costs. We address the issue of the number and choice of categories and their impact on NRI. We contrast category-based NRI with one which is category-free and conclude that NRIs cannot be compared across studies unless they are defined in the same manner. We discuss the impact of differing event rates when models are applied to different samples or definitions of events and durations of follow-up vary between studies. We also show how NRI can be applied to case‐control data. The concepts presented in the paper are illustrated in a Framingham Heart Study example. In conclusion, NRI can be readily calculated for survival, competing risk, and case‐control data, is more objective and comparable across studies using the category-free version, and can include relative costs for classifications. We recommend that researchers clearly define and justify the choices they make when choosing NRI for their application. Copyright © 2010 John Wiley & Sons, Ltd.

2,059 citations


Additional excerpts

  • ...[27]....

    [...]

Journal ArticleDOI
Andrew I R Maas1, David K. Menon2, P. David Adelson3, Nada Andelic4  +339 moreInstitutions (110)
TL;DR: The InTBIR Participants and Investigators have provided informed consent for the study to take place in Poland.
Abstract: Additional co-authors: Endre Czeiter, Marek Czosnyka, Ramon Diaz-Arrastia, Jens P Dreier, Ann-Christine Duhaime, Ari Ercole, Thomas A van Essen, Valery L Feigin, Guoyi Gao, Joseph Giacino, Laura E Gonzalez-Lara, Russell L Gruen, Deepak Gupta, Jed A Hartings, Sean Hill, Ji-yao Jiang, Naomi Ketharanathan, Erwin J O Kompanje, Linda Lanyon, Steven Laureys, Fiona Lecky, Harvey Levin, Hester F Lingsma, Marc Maegele, Marek Majdan, Geoffrey Manley, Jill Marsteller, Luciana Mascia, Charles McFadyen, Stefania Mondello, Virginia Newcombe, Aarno Palotie, Paul M Parizel, Wilco Peul, James Piercy, Suzanne Polinder, Louis Puybasset, Todd E Rasmussen, Rolf Rossaint, Peter Smielewski, Jeannette Soderberg, Simon J Stanworth, Murray B Stein, Nicole von Steinbuchel, William Stewart, Ewout W Steyerberg, Nino Stocchetti, Anneliese Synnot, Braden Te Ao, Olli Tenovuo, Alice Theadom, Dick Tibboel, Walter Videtta, Kevin K W Wang, W Huw Williams, Kristine Yaffe for the InTBIR Participants and Investigators

1,354 citations

Journal ArticleDOI
TL;DR: The ACS NSQIP surgical risk calculator is a decision-support tool based on reliable multi-institutional clinical data, which can be used to estimate the risks of most operations.
Abstract: Background Accurately estimating surgical risks is critical for shared decision making and informed consent. The Centers for Medicare and Medicaid Services may soon put forth a measure requiring surgeons to provide patients with patient-specific, empirically derived estimates of postoperative complications. Our objectives were to develop a universal surgical risk estimation tool, to compare performance of the universal vs previous procedure-specific surgical risk calculators, and to allow surgeons to empirically adjust the estimates of risk. Study Design Using standardized clinical data from 393 ACS NSQIP hospitals, a web-based tool was developed to allow surgeons to easily enter 21 preoperative factors (demographics, comorbidities, procedure). Regression models were developed to predict 8 outcomes based on the preoperative risk factors. The universal model was compared with procedure-specific models. To incorporate surgeon input, a subjective surgeon adjustment score, allowing risk estimates to vary within the estimate's confidence interval, was introduced and tested with 80 surgeons using 10 case scenarios. Results Based on 1,414,006 patients encompassing 1,557 unique CPT codes, a universal surgical risk calculator model was developed that had excellent performance for mortality (c-statistic = 0.944; Brier score = 0.011 [where scores approaching 0 are better]), morbidity (c-statistic = 0.816, Brier score = 0.069), and 6 additional complications (c-statistics > 0.8). Predictions were similarly robust for the universal calculator vs procedure-specific calculators (eg, colorectal). Surgeons demonstrated considerable agreement on the case scenario scoring (80% to 100% agreement), suggesting reliable score assignment between surgeons. Conclusions The ACS NSQIP surgical risk calculator is a decision-support tool based on reliable multi-institutional clinical data, which can be used to estimate the risks of most operations. The ACS NSQIP surgical risk calculator will allow clinicians and patients to make decisions using empirically derived, patient-specific postoperative risks.

1,327 citations

References
More filters
Journal ArticleDOI
TL;DR: This article proposes methods for combining estimates of the cause-specific hazard functions under the proportional hazards formulation, but these methods do not allow the analyst to directly assess the effect of a covariate on the marginal probability function.
Abstract: With explanatory covariates, the standard analysis for competing risks data involves modeling the cause-specific hazard functions via a proportional hazards assumption Unfortunately, the cause-specific hazard function does not have a direct interpretation in terms of survival probabilities for the particular failure type In recent years many clinicians have begun using the cumulative incidence function, the marginal failure probabilities for a particular cause, which is intuitively appealing and more easily explained to the nonstatistician The cumulative incidence is especially relevant in cost-effectiveness analyses in which the survival probabilities are needed to determine treatment utility Previously, authors have considered methods for combining estimates of the cause-specific hazard functions under the proportional hazards formulation However, these methods do not allow the analyst to directly assess the effect of a covariate on the marginal probability function In this article we pro

11,109 citations


"Assessing the performance of predic..." refers background in this paper

  • ...competing risks in survival analyses of nonfatal outcomes, such as failure of heart valves,(61) or mortality due to various causes.(62) Disregarding competing risks often leads to overestimation of absolute risk....

    [...]

Journal ArticleDOI
01 Jan 1950-Cancer

8,687 citations

BookDOI
01 Jan 2001
TL;DR: In this article, the authors present a case study in least squares fitting and interpretation of a linear model, where they use nonparametric transformations of X and Y to fit a linear regression model.
Abstract: Introduction * General Aspects of Fitting Regression Models * Missing Data * Multivariable Modeling Strategies * Resampling, Validating, Describing, and Simplifying the Model * S-PLUS Software * Case Study in Least Squares Fitting and Interpretation of a Linear Model * Case Study in Imputation and Data Reduction * Overview of Maximum Likelihood Estimation * Binary Logistic Regression * Logistic Model Case Study 1: Predicting Cause of Death * Logistic Model Case Study 2: Survival of Titanic Passengers * Ordinal Logistic Regression * Case Study in Ordinal Regrssion, Data Reduction, and Penalization * Models Using Nonparametic Transformations of X and Y * Introduction to Survival Analysis * Parametric Survival Models * Case Study in Parametric Survival Modeling and Model Approximation * Cox Proportional Hazards Regression Model * Case Study in Cox Regression

7,264 citations

Journal ArticleDOI
TL;DR: Two new measures, one based on integrated sensitivity and specificity and the other on reclassification tables, are introduced that offer incremental information over the AUC and are proposed to be considered in addition to the A UC when assessing the performance of newer biomarkers.
Abstract: Identification of key factors associated with the risk of developing cardiovascular disease and quantification of this risk using multivariable prediction algorithms are among the major advances made in preventive cardiology and cardiovascular epidemiology in the 20th century. The ongoing discovery of new risk markers by scientists presents opportunities and challenges for statisticians and clinicians to evaluate these biomarkers and to develop new risk formulations that incorporate them. One of the key questions is how best to assess and quantify the improvement in risk prediction offered by these new models. Demonstration of a statistically significant association of a new biomarker with cardiovascular risk is not enough. Some researchers have advanced that the improvement in the area under the receiver-operating-characteristic curve (AUC) should be the main criterion, whereas others argue that better measures of performance of prediction models are needed. In this paper, we address this question by introducing two new measures, one based on integrated sensitivity and specificity and the other on reclassification tables. These new measures offer incremental information over the AUC. We discuss the properties of these new measures and contrast them with the AUC. We also develop simple asymptotic tests of significance. We illustrate the use of these measures with an example from the Framingham Heart Study. We propose that scientists consider these types of measures in addition to the AUC when assessing the performance of newer biomarkers.

5,651 citations


"Assessing the performance of predic..." refers methods in this paper

  • ...model with additional predictive information from a biomarker or other sources.(8,9,45) Many measures provide numerical...

    [...]

  • ...Also, a measure that integrates net reclassification over all possible cut-offs for the probability of the outcome was proposed (integrated discrimination improvement IDI ).(9) The IDI is equivalent to the difference in discrimination slopes of 2 models, and to the difference in Pearson R(2) measures,(45) or the difference is scaled Brier scores....

    [...]

Journal ArticleDOI
TL;DR: In this article, a generalization of the coefficient of determination R2 to general regression models is discussed, and a modification of an earlier definition to allow for discrete models is proposed.
Abstract: SUMMARY A generalization of the coefficient of determination R2 to general regression models is discussed. A modification of an earlier definition to allow for discrete models is proposed.

5,085 citations


"Assessing the performance of predic..." refers methods in this paper

  • ...For generalized linear models, Nagelkerke’s R(2) is often used.(1,33) This is a logarithmic scoring rule....

    [...]