scispace - formally typeset
Search or ask a question

Showing papers on "Brier score published in 2014"


Journal ArticleDOI
TL;DR: A new approach to competing risks using random forests is introduced and it is shown that the method is highly effective for both prediction and variable selection in high-dimensional problems and in settings such as HIV/AIDS that involve many competing risks.
Abstract: We introduce a new approach to competing risks using random forests. Our method is fully non-parametric and can be used for selecting event-specific variables and for estimating the cumulative incidence function. We show that the method is highly effective for both prediction and variable selection in high-dimensional problems and in settings such as HIV/AIDS that involve many competing risks.

180 citations


Journal ArticleDOI
TL;DR: If IDI and NRI are used to measure gain in prediction performance, then poorly calibrated models may appear advantageous, and in a simulation study, even the model that actually generates the data can be improved on without adding measured information.
Abstract: The 'integrated discrimination improvement' (IDI) and the 'net reclassification index' (NRI) are statistics proposed as measures of the incremental prognostic impact that a new biomarker will have when added to an existing prediction model for a binary outcome. By design, both measures were meant to be intuitively appropriate, and the IDI and NRI formulae do look intuitively plausible. Both have become increasingly popular. We shall argue, however, that their use is not always safe. If IDI and NRI are used to measure gain in prediction performance, then poorly calibrated models may appear advantageous, and in a simulation study, even the model that actually generates the data (and hence is the best possible model) can be improved on without adding measured information. We illustrate these shortcomings in actual cancer data as well as by Monte Carlo simulations. In these examples, we contrast IDI and NRI with the area under ROC and the Brier score. Unlike IDI and NRI, these traditional measures have the characteristic that prognostic performance cannot be accidentally or deliberately inflated.

164 citations


Journal ArticleDOI
TL;DR: It is argued that the QS is ready to become as popular as the Brier score in forecast verification and its decomposition is illustrated on precipitation forecasts derived from the mesoscale weather prediction ensemble COSMO-DE-EPS of the German Meteorological Service.
Abstract: This study expands the pool of verification methods for probabilistic weather and climate predictions by a decomposition of the quantile score (QS). The QS is a proper score function and evaluates predictive quantiles on a set of forecast–observation pairs. We introduce a decomposition of the QS in reliability, resolution and uncertainty and discuss the biases of the decomposition. Further, a reliability diagram for quantile forecasts is presented. Verification with the QS and its decomposition is illustrated on precipitation forecasts derived from the mesoscale weather prediction ensemble COSMO-DE-EPS of the German Meteorological Service. We argue that the QS is ready to become as popular as the Brier score in forecast verification.

79 citations


Journal ArticleDOI
28 Feb 2014-PLOS ONE
TL;DR: Multivariate NTCP models with LASSO can be used to predict patient-rated xerostomia after IMRT, and the overall performance for both time points was satisfactory and corresponded well with the expected values.
Abstract: Purpose The aim of this study was to develop a multivariate logistic regression model with least absolute shrinkage and selection operator (LASSO) to make valid predictions about the incidence of moderate-to-severe patient-rated xerostomia among head and neck cancer (HNC) patients treated with IMRT. Methods and Materials Quality of life questionnaire datasets from 206 patients with HNC were analyzed. The European Organization for Research and Treatment of Cancer QLQ-H&N35 and QLQ-C30 questionnaires were used as the endpoint evaluation. The primary endpoint (grade 3+ xerostomia) was defined as moderate-to-severe xerostomia at 3 (XER3m) and 12 months (XER12m) after the completion of IMRT. Normal tissue complication probability (NTCP) models were developed. The optimal and suboptimal numbers of prognostic factors for a multivariate logistic regression model were determined using the LASSO with bootstrapping technique. Statistical analysis was performed using the scaled Brier score, Nagelkerke R2, chi-squared test, Omnibus, Hosmer-Lemeshow test, and the AUC. Results Eight prognostic factors were selected by LASSO for the 3-month time point: Dmean-c, Dmean-i, age, financial status, T stage, AJCC stage, smoking, and education. Nine prognostic factors were selected for the 12-month time point: Dmean-i, education, Dmean-c, smoking, T stage, baseline xerostomia, alcohol abuse, family history, and node classification. In the selection of the suboptimal number of prognostic factors by LASSO, three suboptimal prognostic factors were fine-tuned by Hosmer-Lemeshow test and AUC, i.e., Dmean-c, Dmean-i, and age for the 3-month time point. Five suboptimal prognostic factors were also selected for the 12-month time point, i.e., Dmean-i, education, Dmean-c, smoking, and T stage. The overall performance for both time points of the NTCP model in terms of scaled Brier score, Omnibus, and Nagelkerke R2 was satisfactory and corresponded well with the expected values. Conclusions Multivariate NTCP models with LASSO can be used to predict patient-rated xerostomia after IMRT.

79 citations


Journal ArticleDOI
TL;DR: In this article, a characterization is given of a general class of fair scores for ensembles that are interpreted as random samples, and a definition of fairness is also proposed for ensemble models with members interpreted as being dependent and it is shown that fair scores exist only for some forms of dependence.
Abstract: The notion of fair scores for ensemble forecasts was introduced recently to reward ensembles with members that behave as though they and the verifying observation are sampled from the same distribution. In the case of forecasting binary outcomes, a characterization is given of a general class of fair scores for ensembles that are interpreted as random samples. This is also used to construct classes of fair scores for ensembles that forecast multicategory and continuous outcomes. The usual Brier, ranked probability and continuous ranked probability scores for ensemble forecasts are shown to be unfair, while adjusted versions of these scores are shown to be fair. A definition of fairness is also proposed for ensembles with members that are interpreted as being dependent and it is shown that fair scores exist only for some forms of dependence.

77 citations


Journal ArticleDOI
TL;DR: This paper develops and compares a number of models for calibrating and aggregating forecasts that exploit the well-known fact that individuals exhibit systematic biases during judgment and elicitation.
Abstract: It is known that the average of many forecasts about a future event tends to outperform the individual assessments. With the goal of further improving forecast performance, this paper develops and compares a number of models for calibrating and aggregating forecasts that exploit the well-known fact that individuals exhibit systematic biases during judgment and elicitation. All of the models recalibrate judgments or mean judgments via a two-parameter calibration function, and differ in terms of whether (1) the calibration function is applied before or after the averaging, (2) averaging is done in probability or log-odds space, and (3) individual differences are captured via hierarchical modeling. Of the non-hierarchical models, the one that first recalibrates the individual judgments and then averages them in log-odds is the best relative to simple averaging, with 26.7 % improvement in Brier score and better performance on 86 % of the individual problems. The hierarchical version of this model does slightly better in terms of mean Brier score (28.2 %) and slightly worse in terms of individual problems (85 %).

59 citations


Journal ArticleDOI
TL;DR: K‐nearest neighbors, bagged nearest neighbors, random forests for probability estimation trees, and support vector machines with the kernels of Bessel, linear, Laplacian, and radial basis type are investigated, showing promising performance over all constructed models.
Abstract: Machine learning methods are applied to three different large datasets, all dealing with probability estimation problems for dichotomous or multicategory data. Specifically, we investigate k-nearest neighbors, bagged nearest neighbors, random forests for probability estimation trees, and support vector machines with the kernels of Bessel, linear, Laplacian, and radial basis type. Comparisons are made with logistic regression. The dataset from the German Stroke Study Collaboration with dichotomous and three-category outcome variables allows, in particular, for temporal and external validation. The other two datasets are freely available from the UCI learning repository and provide dichotomous outcome variables. One of them, the Cleveland Clinic Foundation Heart Disease dataset, uses data from one clinic for training and from three clinics for external validation, while the other, the thyroid disease dataset, allows for temporal validation by separating data into training and test data by date of recruitment into study. For dichotomous outcome variables, we use receiver operating characteristics, areas under the curve values with bootstrapped 95% confidence intervals, and Hosmer-Lemeshow-type figures as comparison criteria. For dichotomous and multicategory outcomes, we calculated bootstrap Brier scores with 95% confidence intervals and also compared them through bootstrapping. In a supplement, we provide R code for performing the analyses and for random forest analyses in Random Jungle, version 2.1.0. The learning machines show promising performance over all constructed models. They are simple to apply and serve as an alternative approach to logistic or multinomial logistic regression analysis.

53 citations


Journal ArticleDOI
Marion Mittermaier1
TL;DR: The use of conventional metrics and precise matching of the forecast to conventional synoptic observations in space and time is replaced with the use of inherently probabilistic metrics such as the Brier score, ranked probability, and continuous ranked probability scores applied to neighborhoods of forecast grid points.
Abstract: Routine verification of deterministic numerical weather prediction (NWP) forecasts from the convection-permitting 4-km (UK4) and near-convection-resolving 1.5-km (UKV) configurations of the Met Office Unified Model (MetUM) has shown that it is hard to consistently demonstrate an improvement in skill from the higher-resolution model, even though subjective comparison suggests that it performs better. In this paper the use of conventional metrics and precise matching (through extracting the nearest grid point to an observing site) of the forecast to conventional synoptic observations in space and time is replaced with the use of inherently probabilistic metrics such as the Brier score, ranked probability, and continuous ranked probability scores applied to neighborhoods of forecast grid points. Three neighborhood sizes were used: ~4, ~12, and ~25 km, which match the sizes of the grid elements currently used operationally. Six surface variables were considered: 2-m temperature, 10-m wind speed, total...

53 citations


Journal ArticleDOI
TL;DR: A new metric is proposed, the stratified Brier score, to capture class-specific calibration, analogous to the per-class metrics widely used to assess the discriminative performance of classifiers in imbalanced scenarios, and is extended in this direction by providing ample additional empirical evidence for the utility.
Abstract: Obtaining good probability estimates is imperative for many applications. The increased uncertainty and typically asymmetric costs surrounding rare events increase this need. Experts (and classification systems) often rely on probabilities to inform decisions. However, we demonstrate that class probability estimates obtained via supervised learning in imbalanced scenarios systematically underestimate the probabilities for minority class instances, despite ostensibly good overall calibration. To our knowledge, this problem has not previously been explored. We propose a new metric, the stratified Brier score, to capture class-specific calibration, analogous to the per-class metrics widely used to assess the discriminative performance of classifiers in imbalanced scenarios. We propose a simple, effective method to mitigate the bias of probability estimates for imbalanced data that bags estimators independently calibrated over balanced bootstrap samples. This approach drastically improves performance on the minority instances without greatly affecting overall calibration. We extend our previous work in this direction by providing ample additional empirical evidence for the utility of this strategy, using both support vector machines and boosted decision trees as base learners. Finally, we show that additional uncertainty can be exploited via a Bayesian approach by considering posterior distributions over bagged probability estimates.

46 citations


Journal ArticleDOI
07 Mar 2014-PLOS ONE
TL;DR: The Pietra and the scaled Brier indices are recommended for prediction model performance measurement, in light of their ease of interpretation, clinical relevance and sensitivity to gray-zone resolving markers.
Abstract: As a performance measure for a prediction model, the area under the receiver operating characteristic curve (AUC) is insensitive to the addition of strong markers. A number of measures sensitive to performance change have recently been proposed; however, these relative-performance measures may lead to self-contradictory conclusions. This paper examines alternative performance measures for prediction models: the Lorenz curve-based Gini and Pietra indices, and a standardized version of the Brier score, the scaled Brier. Computer simulations are performed in order to study the sensitivity of these measures to performance change when a new marker is added to a baseline model. When the discrimination power of the added marker is concentrated in the gray zone of the baseline model, the AUC and the Gini show minimal performance improvements. The Pietra and the scaled Brier show more significant improvements in the same situation, comparatively. The Pietra and the scaled Brier indices are therefore recommended for prediction model performance measurement, in light of their ease of interpretation, clinical relevance and sensitivity to gray-zone resolving markers.

45 citations


Journal ArticleDOI
TL;DR: Sociodemographic and health characteristics as well as perceptions of the environment are strong predictors of participation in community-based physical activity programs in selected cities of Brazil.
Abstract: AB Purpose: This study aimed to develop and validate a risk prediction model to examine the characteristics that are associated with participation in community-based physical activity programs in Brazil. Methods: We used pooled data from three surveys conducted from 2007 to 2009 in state capitals of Brazil with 6166 adults. A risk prediction model was built considering program participation as an outcome. The predictive accuracy of the model was quantified through discrimination (C statistic) and calibration (Brier score) properties. Bootstrapping methods were used to validate the predictive accuracy of the final model. Results: The final model showed sex (women: odds ratio [OR] = 3.18, 95% confidence interval [CI] = 2.14-4.71), having less than high school degree (OR = 1.71, 95% CI = 1.16-2.53), reporting a good health (OR = 1.58, 95% CI = 1.02-2.24) or very good/excellent health (OR = 1.62, 95% CI = 1.05-2.51), having any comorbidity (OR = 1.74, 95% CI = 1.26-2.39), and perceiving the environment as safe to walk at night (OR = 1.59, 95% CI = 1.18-2.15) as predictors of participation in physical activity programs. Accuracy indices were adequate (C index = 0.778, Brier score = 0.031) and similar to those obtained from bootstrapping (C index = 0.792, Brier score = 0.030). Conclusions: Sociodemographic and health characteristics as well as perceptions of the environment are strong predictors of participation in community-based programs in selected cities of Brazil.

Journal ArticleDOI
TL;DR: Its ability in predicting the oceanic flow generated by the coupled system is investigated, and different phases in the error dynamics are found: for short lead times, an initial overdispersion of the ensemble forecast is present while the ensemble mean follows a dynamics reminiscent of the combined amplification of initial condition and model errors for deterministic systems; for longer leads, a reliable diffusive ensemble spread is observed.
Abstract: There is a growing interest in developing stochastic schemes for the description of processes that are poorly represented in atmospheric and climate models, in order to increase their variability and reduce the impact of model errors. The use of such noise could however have adverse effects by modifying in undesired ways a certain number of moments of their probability distributions. In this work, the impact of developing a stochastic scheme (based on stochastic averaging) for the ocean is explored in the context of a low-order coupled (deterministic) ocean–atmosphere system. After briefly analysing its variability, its ability in predicting the oceanic flow generated by the coupled system is investigated. Different phases in the error dynamics are found: for short lead times, an initial overdispersion of the ensemble forecast is present while the ensemble mean follows a dynamics reminiscent of the combined amplification of initial condition and model errors for deterministic systems; for longer lead times, a reliable diffusive ensemble spread is observed. These different phases are also found for ensemble-oriented skill measures like the Brier score and the rank histogram. The implications of these features on building stochastic models are then briefly discussed.

Proceedings ArticleDOI
15 Sep 2014
TL;DR: The main conclusion is that the suggested choice of approach to handle sparsity is highly dependent on the performance metric, or the task, of interest, and a sampling based approach is recommended.
Abstract: When using electronic health record (EHR) data to build models for predicting adverse drug effects (ADEs), one is typically facing the problem of data sparsity, i.e., Drugs and diagnosis codes that could be used for predicting a certain ADE are absent for most observations. For such tasks, the ability to effectively handle sparsity by the employed machine learning technique is crucial. The state-of-the-art random forest algorithm is frequently employed to handle this type of data. It has however recently been demonstrated that the algorithm is biased towards the majority class, which may result in a low predictive performance on EHR data with large numbers of sparse features. In this study, approaches to handle this problem are empirically evaluated using 14 ADE datasets and three performance metrics, F1-score, AUC and Brier score. Two resampling based techniques are investigated and compared to two baseline approaches. The experimental results indicate that, for larger forests, the resampling methods outperform the baseline approaches when considering F1-score, which is consistent with the metric being affected by class bias. The approaches perform on a similar level with respect to AUC, which can be explained by the metric not being sensitive to class bias. Finally, when considering the squared error (Brier score) of individual predictions, one of the baseline approaches turns out to be ahead of the others. A bias-variance analysis shows that this is an effect of the individual trees being more correct on average for the baseline approach and that this outweighs the expected loss from a lower variance. The main conclusion is that the suggested choice of approach to handle sparsity is highly dependent on the performance metric, or the task, of interest. If the task is to accurately assign an ADE to a patient record, a sampling based approach is recommended. If the task is to rank patients according to risk of a certain ADE, the choice of approach is of minor importance. Finally, if the task is to accurately assign probabilities for a certain ADE, then one of the baseline approaches is recommended.

Book ChapterDOI
15 Sep 2014
TL;DR: In this article, the LS-ECOC method is modified to take the reliability of two-class probabilities into account, and the concept of a reliability map is introduced to accompany the more conventional notion of calibration map.
Abstract: We propose a general method to assess the reliability of two-class probabilities in an instance-wise manner. This is relevant, for instance, for obtaining calibrated multi-class probabilities from two-class probability scores. The LS-ECOC method approaches this by performing least-squares fitting over a suitable error-correcting output code matrix, where the optimisation resolves potential conflicts in the input probabilities. While this gives all input probabilities equal weight, we would like to spend less effort fitting unreliable probability estimates. We introduce the concept of a reliability map to accompany the more conventional notion of calibration map; and LS-ECOC-R which modifies LS-ECOC to take reliability into account. We demonstrate on synthetic data that this gets us closer to the Bayes-optimal classifier, even if the base classifiers are linear and hence have high bias. Results on UCI data sets demonstrate that multi-class accuracy also improves.

Journal ArticleDOI
TL;DR: Performance analysis and comparison evince a predictive power of these models for FX rates at high frequencies and show that the proposed CTBNC is more effective and more efficient than dynamic Bayesian network classifier.
Abstract: Prediction of foreign exchange (FX) rates is addressed as a binary classification problem in which a continuous time Bayesian network classifier (CTBNC) is developed and used to solve it. An exact algorithm for inference on CTBNC is introduced. The performance of an instance of these classifiers is analysed and compared to that of dynamic Bayesian network by using real tick by tick FX rates. Performance analysis and comparison, based on different metrics such as accuracy, precision, recall and Brier score, evince a predictive power of these models for FX rates at high frequencies. The achieved results also show that the proposed CTBNC is more effective and more efficient than dynamic Bayesian network classifier. In particular, it allows to perform high frequency prediction of FX rates in cases where dynamic Bayesian networks-based models are computationally intractable.

Journal ArticleDOI
TL;DR: Only moderate prediction accuracy could be achieved using the selected information from the Danish register RSS and other variables need to be included in order to establish a prediction method which provides more accurate risk profiles for long-term sick-listed persons.
Abstract: Targeted interventions for the long-term sick-listed may prevent permanent exclusion from the labour force. We aimed to develop a prediction method for identifying high risk groups for continued or recurrent long-term sickness absence, unemployment, or disability among persons on long-term sick leave. We obtained individual characteristics and follow-up data from the Danish Register of Sickness Absence Compensation Benefits and Social Transfer Payments (RSS) during 2004 to 2010 for 189,279 Danes who experienced a period of long-term sickness absence (4+ weeks). In a learning data set, statistical prediction methods were built using logistic regression and a discrete event simulation approach for a one year prediction horizon. Personalized risk profiles were obtained for five outcomes: employment, unemployment, recurrent sickness absence, continuous long-term sickness absence, and early retirement from the labour market. Predictor variables included gender, age, socio-economic position, job type, chronic disease status, history of sickness absence, and prior history of unemployment. Separate models were built for times of economic growth (2005–2007) and times of recession (2008–2010). The accuracy of the prediction models was assessed with analyses of Receiver Operating Characteristic (ROC) curves and the Brier score in an independent validation data set. In comparison with a null model which ignored the predictor variables, logistic regression achieved only moderate prediction accuracy for the five outcome states. Results obtained with discrete event simulation were comparable with logistic regression. Only moderate prediction accuracy could be achieved using the selected information from the Danish register RSS. Other variables need to be included in order to establish a prediction method which provides more accurate risk profiles for long-term sick-listed persons.

Journal ArticleDOI
TL;DR: In this article, the Brier score is decomposed into three components called reliability, resolution and uncertainty, which characterize different forecast attributes given a dataset of forecast probabilities and corresponding binary verifications.
Abstract: The Brier Score is a widely used criterion to assess the quality of probabilistic predictions of binary events. The expectation value of the Brier Score can be decomposed into the sum of three components called reliability, resolution and uncertainty, which characterize different forecast attributes. Given a dataset of forecast probabilities and corresponding binary verifications, these three components can be estimated empirically. Here, propagation of uncertainty is used to derive expressions that approximate the sampling variances of the estimated components. Variance estimates are provided for both the traditional estimators, as well as for refined estimators that include a bias correction. Applications of the derived variance estimates to artificial data illustrate their validity and application to a meteorological prediction problem illustrates a possible usage case. The observed increase of variance of the bias-corrected estimators is discussed.

OtherDOI
29 Sep 2014
TL;DR: In this paper, a number of measures of discrimination and calibration, along with graphical representations of calibration and discrimination assessment, are presented, including the c-index and the Hosmer-Lemeshow χ2 statistic.
Abstract: This article presents a number of measures of discrimination and calibration, along with graphical representations of calibration and discrimination assessment. It emphasizes multivariate classification rules for models, where the classification is into one of two possible states, and also discusses extensions to multistate classifications. The c-index and the Hosmer–Lemeshow χ2 statistic are the most widely used measures of discrimination and calibration. Keywords: discrimination analysis; calibration; receiver-operating characteristic(ROC) curve'; c-index; brier score; Sander's decomposition; Murphy decomposition; Yates decomposition; rank order statistic; Hosmer–Lemeshow chi-square tests; goodness of fit; likelihood ratio test; graphical displays; multistate outcome

Journal ArticleDOI
TL;DR: The parameters of the Lyman–Kutcher–Burman (LKB), Kallman, and Logit+EUD models are optimized by minimizing the Brier score for a group of 302 prostate cancer patients.

Book ChapterDOI
01 Jun 2014
TL;DR: In the paper the comparison of ensemble based methods applied to censored survival data was conducted, and the integrated Brier score, the prediction measure developed for survival data, was evaluated.
Abstract: In the paper the comparison of ensemble based methods applied to censored survival data was conducted. Bagging survival trees, dipolar survival tree ensemble and random forest were taken into consideration. The prediction ability was evaluated by the integrated Brier score, the prediction measure developed for survival data. Two real datasets with different percentage of censored observations were examined.


Journal Article
TL;DR: It turns out that it is difficult to devise ex ante tests to screen informed experts from uninformed experts, because a test which is designed to pass a genuine expert with high probability can also be passed by a strategic charlatan with high probabilities.
Abstract: Corporate lawyers and their clients routinely hire experts to deliver probabilistic forecasts. For instance, they hire credit rating agencies to deliver credit ratings, which effectively are probabilistic forecasts of credit default events. They also hire experts to deliver probabilistic forecasts of economic, legal, and political events, and even weather events. In hiring an expert, however, they face two distinct problems. The first is a moral hazard problem—how to evaluate, or “score,” an expert’s forecasts in a way that incentivizes the expert to honestly report her opinions (and, importantly, does not perversely incentivize the expert to dishonestly report her opinions to game the system). The second is an adverse selection problem—how to distinguish informed experts (genuine experts) from uninformed experts (charlatans).1 The scoring problem was famously solved by Glenn Brier, who proposed a scoring rule that gives the proper incentives.2 The Brier score is essentially the mean squared error of the expert’s forecasts over the evaluation sample. Solutions to the “charlatans” problem, however, have proven harder to come by. When it comes to probabilistic forecasts, it turns out that it is difficult to devise ex ante tests to screen informed experts from uninformed experts. Basically, the difficulty is that a test which is designed to pass a genuine expert with high probability can also be passed by a strategic charlatan with high probability.3 And ex post warranties are generally not effective.4