scispace - formally typeset
Search or ask a question

Showing papers on "Brier score published in 2010"


Journal ArticleDOI
TL;DR: It is suggested that reporting discrimination and calibration will always be important for a prediction model and decision-analytic measures should be reported if the predictive model is to be used for clinical decisions.
Abstract: The performance of prediction models can be assessed using a variety of methods and metrics. Traditional measures for binary and survival outcomes include the Brier score to indicate overall model performance, the concordance (or c) statistic for discriminative ability (or area under the receiver operating characteristic [ROC] curve), and goodness-of-fit statistics for calibration.Several new measures have recently been proposed that can be seen as refinements of discrimination measures, including variants of the c statistic for survival, reclassification tables, net reclassification improvement (NRI), and integrated discrimination improvement (IDI). Moreover, decision-analytic measures have been proposed, including decision curves to plot the net benefit achieved by making decisions based on model predictions.We aimed to define the role of these relatively novel approaches in the evaluation of the performance of prediction models. For illustration, we present a case study of predicting the presence of residual tumor versus benign tissue in patients with testicular cancer (n = 544 for model development, n = 273 for external validation).We suggest that reporting discrimination and calibration will always be important for a prediction model. Decision-analytic measures should be reported if the predictive model is to be used for clinical decisions. Other measures of performance may be warranted in specific applications, such as reclassification metrics to gain insight into the value of adding a novel predictor to an established model.

3,473 citations



Journal ArticleDOI
TL;DR: A reinterpretation of the logarithmic score or ignorance score, now formulated as the relative entropy or Kullback–Leibler divergence of the forecast distribution from the observation distribution is presented, analogous to the classic decomposition of the Brier score.
Abstract: This paper presents a score that can be used for evaluating probabilistic forecasts of multicategory events The score is a reinterpretation of the logarithmic score or ignorance score, now formulated as the relative entropy or Kullback–Leibler divergence of the forecast distribution from the observation distribution Using the information–theoretical concepts of entropy and relative entropy, a decomposition into three components is presented, analogous to the classic decomposition of the Brier score The information–theoretical twins of the components uncertainty, resolution, and reliability provide diagnostic information about the quality of forecasts The overall score measures the information conveyed by the forecast As was shown recently, information theory provides a sound framework for forecast verification The new decomposition, which has proven to be very useful for the Brier score and is widely used, can help acceptance of the logarithmic score in meteorology

83 citations


Journal ArticleDOI
TL;DR: A univocal measure of forecast goodness is demonstrated to exist, based on the relative entropy between the observed occurrence frequencies and the predicted probabilities for the forecast events, which is the logarithmic score.
Abstract: The problem of probabilistic forecast verification is approached from a theoretical point of view starting from three basic desiderata: additivity, exclusive dependence on physical observations (“locality”), and strictly proper behavior. By imposing such requirements and only using elementary mathematics, a univocal measure of forecast goodness is demonstrated to exist. This measure is the logarithmic score, based on the relative entropy between the observed occurrence frequencies and the predicted probabilities for the forecast events. Information theory is then used as a guide to choose the scoring-scale offset for obtaining meaningful and fair skill scores. Finally the Brier score is assessed and, for single-event forecasts, its equivalence to the second-order approximation of the logarithmic score is shown. The large part of the presented results are far from being new or original, nevertheless their use still meets with some resistance in the weather forecast community. This paper aims at pr...

76 citations


Journal ArticleDOI
TL;DR: In this paper, the effects of serial correlation of forecasts and observations on the sampling properties of forecast verification statistics have been examined for probability forecasts of dichotomous events, for both serially correlated and temporally independent forecasts, and it has been shown that serial correlation is more robust to serial correlation than that of BS Hypothesis tests based on BSS are more powerful than those based on BS, and substantially so for lower-accuracy forecasts of lower-probability events.
Abstract: Relatively little attention has been given to the effects of serial correlation of forecasts and observations on the sampling properties of forecast verification statistics An assumption of serial independence for low-quality forecasts may be reasonable However, forecasts of sufficient quality for autocorrelated events must themselves be autocorrelated: as quality approaches the limit of perfect forecasts, the forecasts become increasingly similar to the corresponding observations The effects of forecast serial correlation on the sampling properties of the Brier Score (BS) and Brier Skill Score (BSS), for probability forecasts of dichotomous events, are examined here As in other settings, the effect of serial correlation is to inflate the variances of the sampling distributions of the two statistics, so that uncorrected confidence intervals are too narrow, and uncorrected hypothesis tests yield p-values that are too small Expressions are given for ‘effective sample size’ corrections for the sampling variances of both BS and BSS, in which it can be seen that the effects of serial correlation on the sampling variances increase with increasing forecast accuracy, and with decreasing climatological event probability The sampling variance of BSS is more robust to serial correlation than that of BS Hypothesis tests based on BSS are seen to be more powerful (ie more sensitive) than those based on BS, and substantially so for lower-accuracy forecasts of lower-probability events, for both serially correlated and temporally independent forecasts Copyright © 2010 Royal Meteorological Society

62 citations


Journal ArticleDOI
TL;DR: This paper describes the development of a tree-based decision model to predict the severity of pediatric asthma exacerbations in the emergency department (ED) at 2 h following triage, constructed from retrospective patient data abstracted from the ED charts.
Abstract: This paper describes the development of a tree-based decision model to predict the severity of pediatric asthma exacerbations in the emergency department (ED) at 2 h following triage. The model was constructed from retrospective patient data abstracted from the ED charts. The original data was preprocessed to eliminate questionable patient records and to normalize values of age-dependent clinical attributes. The model uses attributes routinely collected in the ED and provides predictions even for incomplete observations. Its performance was verified on independent validating data (split-sample validation) where it demonstrated AUC (area under ROC curve) of 0.83, sensitivity of 84%, specificity of 71% and the Brier score of 0.18. The model is intended to supplement an asthma clinical practice guideline, however, it can be also used as a stand-alone decision tool.

35 citations


Journal ArticleDOI
TL;DR: It is shown how infinite sequences of densities with defined properties can be used to evaluate the expected performance of mathematical aggregation rules for elicited densities.
Abstract: It is shown how infinite sequences of densities with defined properties can be used to evaluate the expected performance of mathematical aggregation rules for elicited densities. The performance of these rules is measured through the average variance, calibration, and average Brier score of the aggregates. A general result for the calibration of the arithmetic average of densities from well-calibrated independent experts is given. Arithmetic and geometric aggregation rules are compared using sequences of normal densities. Sequences are developed that exhibit dependence among experts and lack of calibration. The impact of correlation, number of experts, and degree of calibration on the performance of the aggregation is demonstrated.

17 citations


Journal ArticleDOI
TL;DR: In this article, the discriminatory powers of Logit, KMV, and zero-price probability (ZPP) models that represent respectively the regressive fitting model, the option-based pricing model, and the GARCH time series simulation model were analyzed.
Abstract: This paper applies the Taiwan electronics industry data to detect the discriminatory powers of Logit, KMV, and zero-price probability (ZPP) models that represent respectively the regressive fitting model, the option-based pricing model, and the GARCH time series simulation model. In our circumstances, according to cumulative accuracy profile, receiver operating characteristic, and even Brier score, the KMV performs the worst. The disadvantages for KMV are that the equity market exists some nonlinear characteristics, the unknown market value of asset affected by the change of capital structure is not exogenous, and the failure point is difficult to be estimated correctly. Besides, KMV is however too simple to model the fluctuation of the equity value as the GARCH does. On the other hand, the Logit performs above average. To preclude over-fitting and keep model parsimonious, two significant factors are extracted from as many as forty financial variables for the logistic regression on binary failure data. The result of Logit training has perfect discrimination. However, for the post-sample data, the fitting to categorical but not ordinal data makes Logit have the divergent failure predicted probabilities and highest Briser Score. In practical, ZPP GARCHNorm uses just equity value to predict firm failure but it performs remarkably well supposing that downward price trend or volatility persistence in stock price changes is appropriately caught. It implies that the distorted signals such as overreaction of traders and insider trading would definitely impair the ZPP GARCHNorm. Nevertheless, the larger type I error than type II error in all models indicates that the prediction of non-failed firms should be more examined further than that of failed firms.

16 citations


Journal ArticleDOI
TL;DR: It is shown that boosting of simple base classifiers gives classification rules with improved predictive ability, however, the performance of boosting classifiers was not generally superior to theperformance of logistic regression.
Abstract: Objectives: In clinical medicine, the accuracy achieved by classification rules is often not sufficient to justify their use in daily practice. In order to improve classifiers it has become popular to combine single classification rules into a classification ensemble. Two popular boosting methods will be compared with classical statistical approaches. Methods: Using data from a clinical study on the diagnosis of breast tumors and by simulation we will compare AdaBoost with gradient boosting ensembles of regression trees. We will also consider a tree approach and logistic regression as traditional competitors. In logistic regression we allow to select non- linear effects by the fractional polynomial approach. Performance of the classifiers will be assessed by estimated misclassification rates and the Brier score. Results: We will show that boosting of simple base classifiers gives classification rules with improved predictive ability. However, the performance of boosting classifiers was not generally superior to the performance of logistic regression. In contrast to the computer-intensive methods the latter are based on classifiers which are much easier to interpret and to use. Conclusions: In medical applications, the logistic regression model remains a method of choice or, at least, a serious competitor of more sophisticated techniques. Refinement of boosting methods by using optimized number of boosting steps may lead to further improvement.

15 citations


Journal ArticleDOI
TL;DR: In this article, an expression for the decomposed Brier score that accounts for weighted forecast-verification pairs is derived, and a comparison of the unweighted and weighted cases using seasonal forecasts from the ENSEMBLES project is presented.
Abstract: The Brier score is widely used in meteorology for quantifying probability forecast quality. The score can be decomposed into terms representing different aspects of forecast quality, but this implicitly requires each forecast-verification pair to be allocated equal weight. In this note an expression is derived for the decomposed Brier score that accounts for weighted forecast-verification pairs. A comparison of the unweighted and weighted cases using seasonal forecasts from the ENSEMBLES project shows that when weights are assigned proportional to the area represented by each grid point (weighting by cosine of latitude), the weighted forecasts give improved Brier and reliability scores compared with the unweighted case. This result is consistent with what is expected, given that tropical predictability is generally better than extratropical predictability.

15 citations


Journal ArticleDOI
TL;DR: This work studies the impact of using dynamic information as features in a machine learning algorithm for the prediction task of classifying critically ill patients according to the time they need to reach a stable state after coronary bypass surgery: less or more than 9 h.
Abstract: This work studies the impact of using dynamic information as features in a machine learning algorithm for the prediction task of classifying critically ill patients in two classes according to the time they need to reach a stable state after coronary bypass surgery: less or more than 9 h. On the basis of five physiological variables (heart rate, systolic arterial blood pressure, systolic pulmonary pressure, blood temperature and oxygen saturation), different dynamic features were extracted, namely the means and standard deviations at different moments in time, coefficients of multivariate autoregressive models and cepstral coefficients. These sets of features served subsequently as inputs for a Gaussian process and the prediction results were compared with the case where only admission data was used for the classification. The dynamic features, especially the cepstral coefficients (aROC: 0.749, Brier score: 0.206), resulted in higher performances when compared to static admission data (aROC: 0.547, Brier score: 0.247). The differences in performance are shown to be significant. In all cases, the Gaussian process classifier outperformed to logistic regression.

Proceedings ArticleDOI
07 Jul 2010
TL;DR: This paper suggests and evaluates a rule extraction algorithm utilizing a more informed fidelity criterion, and suggests a novel algorithms, which is based on genetic programming, minimizes the difference in probability estimates between the extracted and the opaque models, by using the generalized Brier score as fitness function.
Abstract: Most highly accurate predictive modeling techniques produce opaque models. When comprehensible models are required, rule extraction is sometimes used to generate a transparent model, based on the opaque. Naturally, the extracted model should be as similar as possible to the opaque. This criterion, called fidelity, is therefore a key part of the optimization function in most rule extracting algorithms. To the best of our knowledge, all existing rule extraction algorithms targeting fidelity use 0/1 fidelity, i.e., maximize the number of identical classifications. In this paper, we suggests and evaluate a rule extraction algorithm utilizing a more informed fidelity criterion. More specifically, the novel algorithms, which is based on genetic programming, minimizes the difference in probability estimates between the extracted and the opaque models, by using the generalized Brier score as fitness function. Experimental results from 26 UCI data sets show that the suggested algorithm obtained considerably higher accuracy and significantly better AUC than both the exact same rule extraction algorithm maximizing 0/1 fidelity, and the standard tree inducer J48. Somewhat surprisingly, rule extraction using the more informed fidelity metric normally resulted in less complex models, making sure that the improved predictive performance was not achieved on the expense of comprehensibility.

Journal ArticleDOI
TL;DR: In this paper, a variable length Markov model is used to compare the usefulness of three alternatives to the hit and miss score: the Mean Absolute Error, the Ignorance Score, and the Brier score.
Abstract: The problem of predicting the next request during a user's navigation session has been extensively studied. In this context, higher-order Markov models have been widely used to model navigation sessions and to predict the next navigation step, while prediction accuracy has been mainly evaluated with the hit and miss score. We claim that this score, although useful, is not sufficient for evaluating next link prediction models with the aim of finding a sufficient order of the model, the size of a recommendation set, and assessing the impact of unexpected events on the prediction accuracy. Herein, we make use of a variable length Markov model to compare the usefulness of three alternatives to the hit and miss score: the Mean Absolute Error, the Ignorance Score, and the Brier score. We present an extensive evaluation of the methods on real data sets and a comprehensive comparison of the scoring methods.

Journal Article
TL;DR: In this article, three prediction ensemble systems (ECMWF, NCEP and CMA) from the TIGGE-CMA archiving center (TIGGE,THORPEX Interactive Grand Global Ensemble) were assessed against observations of 19 stations located in the Dapoling-Wangjiaba sub-catchment of Huaihe Basin.
Abstract: The precipitation forecasts of three prediction ensemble systems(ECMWF,NCEP and CMA) from the TIGGE-CMA archiving center(TIGGE,THORPEX Interactive Grand Global Ensemble) were assessed against observations of 19 stations located in the Dapoling-Wangjiaba sub-catchment of Huaihe Basin.It covers a period of 37-day beginning on July 1st,2008.The Threat Scores(TS),the Brier Score and a percentile method were employed to assess the performance of the three ensemble prediction systems (EPSs) and their grand ensemble.The skills of probabilistic prediction of the heavy rain events occurring during 22—23 July 2008 were also investigated.The verifications of TS and Brier Scores showed that grand ensemble usually gives the best scores in any of the three EPSs.The verification of Brier Scores showed some members of any three EPSs captured the extreme events even a lead of 10 days.However, the probability skills were usually decreased by a simple ensemble mean.Grand ensemble increased the skill of probabilistic precipitation prediction.Whereas the simulation tends to more underestimate in comparison to the observation as the lead days range from 1 to 10 days.That means the probability forecasts are more skillful with a grand ensemble in comparison to a single EPS.The skills of probabilistic prediction with the grand ensemble were improved not only in space distribution of precipitation,but also in the intensity of it.


Journal ArticleDOI
TL;DR: Although the predictive and discriminative ability of the models increased with each step, even the simplest model containing only data from questions or blood samples alone yielded valid estimates of cardiovascular risk.
Abstract: Objective . To develop a cardiovascular risk model simulating different clinical settings using a staged approach. Design . Using data from 27 477 men and women from the Norwegian Troms o Study in 1986 – 1987 and 1994 – 1995, Cox regression models for either myocardial infarction (MI) or stroke combined with a similar model for the competing event a risk model that assess ten-year risk of MI and stroke was developed. Explanatory variables (questions, simple examinations and blood samples) were added gradually. The model was validated using Hosmer-Lemeshow test, the Brier score, c-index, integrated discrimination improvement (IDI) and Net Reclassifi cation Improvement (NRI). Results . In total, 1 298 events of MI and 769 events of stroke were registered. For MI the model showed excellent discrimination in each step with c-index from 0.833 to 0.946. For stroke the c-index ranged between 0.817 and 0.898. IDI showed signifi cant increases in discrimination. The Brier scores and goodness of fi showed well calibrated models in all steps for all sex- and end-point specifi c models (p � 0.05). Conclusions . Although the predictive and discriminative ability of the models increased with each step, even the simplest model containing only data from questions or blood samples alone yielded valid estimates of cardiovascular risk.

José Borges1
01 Jan 2010
TL;DR: A variable length Markov model is used to compare the usefulness of three alternatives to the hit and miss score: the Mean Absolute Error, the Ignorance Score, and the Brier score and presents an extensive evaluation of the methods on real data sets and a comprehensive comparison of the scoring methods.
Abstract: The problem of predicting the next request during a user’s navigation session has been extensively studied. In this context, higher-order Markov models have been widely used to model navigation sessions and for predicting the next navigation step, while prediction accuracy has been mainly evaluated with the hit and miss score. We claim that this score, although useful, is not sufficient for evaluating next link prediction models with the aim of finding a sufficient order of the model, the size of a recommendation set and assessing the impact of unexpected events on the prediction accuracy. Herein, we make use of a variable length Markov model to compare the usefulness of three alternatives to the hit and miss score: the Mean Absolute Error, the Ignorance Score and the Brier score. We present an extensive evaluation of the methods on real data sets and a comprehensive comparison of the scoring methods.

01 Jan 2010
TL;DR: In this article, a variable length Markov model is used to compare the usefulness of three alternatives to the hit and miss score: the Mean Absolute Error, the Ignorance Score and the Brier score.
Abstract: The problem of predicting the next request during a user’s navigation session has been extensively studied. In this context, higher-order Markov models have been widely used to model navigation sessions and for predicting the next navigation step, while prediction accuracy has been mainly evaluated with the hit and miss score. We claim that this score, although useful, is not sufficient for evaluating next link prediction models with the aim of finding a sufficient order of the model, the size of a recommendation set and assessing the impact of unexpected events on the prediction accuracy. Herein, we make use of a variable length Markov model to compare the usefulness of three alternatives to the hit and miss score: the Mean Absolute Error, the Ignorance Score and the Brier score. We present an extensive evaluation of the methods on real data sets and a comprehensive comparison of the scoring methods.

DissertationDOI
01 Jan 2010
TL;DR: A novel method is proposed that is capable of measuring similarities as well as differences in the performance of different learning models, and is more sensitive to them than the standard ROC curve.
Abstract: This thesis addresses evaluation methods used to measure the performance of machine learning algorithms. In supervised learning, algorithms are designed to perform common learning tasks including classification, ranking, scoring, and probability estimation. This work investigates how information, produced by these various learning tasks, can be utilized by the performance evaluation measure. In the literature, researchers recommend evaluating classification and ranking tasks using the Receiver Operating Characteristics (ROC) curve. In a scoring task, the learning model estimates scores, from the training data, and assigns them to the testing data. These scores are used to express class memberships. Sometimes, these scores represent probabilities in which case the Mean Squared Error (Brier Score) is used to measure their quality. However, if these scores are not probabilities, the task is reduced to a ranking or a classification task by ignoring them. The standard ROC curve also eliminates such scores from its analysis. We claim that using non-probabilistic scores as probabilities is often incorrect, and doing it properly would mean imposing additional assumptions on the algorithm or on the data. Ignoring these scores fully, however, is also problematic since, in practice, although they may provide a poor estimate of probabilities, their magnitudes, nonetheless, provide information that can be valuable for performance analysis. The purpose of this dissertation is to propose a novel method that extends the ROC curve to include such scores. We, therefore, call it the scored ROC curve. In particular, we develop a method to construct a scored ROC curve, demonstrate how to reduce it to a standard ROC curve, and illustrate how it can be used to compare learning models. Our experiments demonstrate that the scored ROC curve is capable of measuring similarities as well as differences in the performance of different learning models, and is more sensitive to them than the standard ROC curve. In addition, we illustrate our method's ability to detect changes in data distribution between training and testing.

01 Jan 2010
TL;DR: In this article, the Brier score and quadratic scoring rules are rejected as legitimate measures of inaccuracy, and an alternative alternative to Jeffrey-Conditionalization is proposed.
Abstract: Leitgeb and Pettigrew (2010a,b) argue that (1) agents should minimize the expected inaccuracy of their beliefs, and (2) inaccuracy should be measured via the Brier score. They show thatincertaindiachroniccases, theseclaimsrequireanalternativetoJeffrey-Conditionalization. I claim that this alternative is an irrational updating procedure and that the Brier score, and quadratic scoring rules generally, should be rejected as legitimate measures of inaccuracy.


Journal ArticleDOI
TL;DR: In this article, the authors developed stratified Cox's hazard models with time-varying covariates to investigate issuer heterogeneity in the rating dynamics of U.S. seasoned issuers, and estimate the probability that a rating survives in its current grade at a certain forecast horizon.
Abstract: This study develops stratified Cox’s hazard models with time-varying covariates to investigate issuer-heterogeneity in the rating dynamics of U.S. seasoned issuers, and estimate the probability that a rating survives in its current grade at a certain forecast horizon. The estimation process controls for the sequence of repeated migration events and accounts for the changes in macro-economic conditions over the period 1984-2000. The study overcomes the challenges in forming time-varying probability estimates when the proportionality assumption of the Cox’s hazard model (Cox, 1972) does not hold and the data sample includes multiple strata. To evaluate the predictive performance of rating history, the Brier score (Brier, 1950) and its covariance decomposition (Yates, 1982) were employed. It is found that the probability of rating migrations is a function of rating history and that rating history is more important than the current rating in determining the probability of a rating change. Tests of forecast accuracy over the period 2005-2010 suggest that the default model exhibits superior forecast performance whereas the downgrade and upgrade hazard models have some predictive accuracy. The findings suggest that an accurate migration forecast framework is more likely to be constructed if rating history variables are incorporated into credit risk models.