scispace - formally typeset
Search or ask a question

Showing papers on "Brier score published in 2016"


Journal ArticleDOI
TL;DR: The experimental results, analysis and statistical tests demonstrate the ability of the proposed combination method to improve prediction performance against all base classifiers, namely, LR, MARS and seven traditional combination methods, in terms of average accuracy, area under the curve, the H-measure and Brier score.
Abstract: Banks take great care when dealing with customer loans to avoid any improper decisions that can lead to loss of opportunity or financial losses. Regarding this, researchers have developed complex credit scoring models using statistical and artificial intelligence (AI) techniques to help banks and financial institutions to support their financial decisions. Various models, from easy to advanced approaches, have been developed in this domain. However, during the last few years there has been marked attention towards development of ensemble or multiple classifier systems, which have proved their ability to be more accurate than single classifier models. However, among the multiple classifier systems models developed in the literature, there has been little consideration given to: 1) combining classifiers of different algorithms (as most have focused on building classifiers of the same algorithm); or 2) exploring different classifier output combination techniques other than the traditional ones, such as majority voting and weighted average. In this paper, the aim is to present a new combination approach based on classifier consensus to combine multiple classifier systems (MCS) of different classification algorithms. Specifically, six of the main well-known base classifiers in this domain are used, namely, logistic regression (LR), neural networks (NN), support vector machines (SVM), random forests (RF), decision trees (DT) and naive Bayes (NB). Two benchmark classifiers are considered as a reference point for comparison with the proposed method and the other classifiers. These are used in combination with LR, which is still considered the industry-standard model for credit scoring models, and multivariate adaptive regression splines (MARS), a widely adopted technique in credit scoring studies. The experimental results, analysis and statistical tests demonstrate the ability of the proposed combination method to improve prediction performance against all base classifiers, namely, LR, MARS and seven traditional combination methods, in terms of average accuracy, area under the curve (AUC), the H-measure and Brier score (BS). The model was validated over five real-world credit scoring datasets.

156 citations


Journal ArticleDOI
TL;DR: The findings suggest that additional considerations are needed to better estimate complications after open VHR, and the ACS Surgical Risk Calculator accurately predicted medical complications, reoperation, and 30-day mortality.
Abstract: Background Preoperative surgical risk assessment continues to be a critical component of clinical decision-making. The ACS Universal Risk Calculator estimates risk for several outcomes based on individual risk profiles. Although this represents a tremendous step toward improving outcomes, studies have reported inaccuracies among certain patient populations. This study aimed to assess the predictive accuracy of the American College of Surgeons' (ACS) Risk Calculator in patients undergoing open ventral hernia repair (VHR). Methods A review of patients undergoing open, isolated VHR between 7/1/2007 and 7/1/2014 by a single surgeon was performed. Risk factors and outcomes were collected as defined by National Surgical Quality Improvement Project. Thirty-day outcomes included serious complication, venous thromboembolism, medical morbidity, surgical site infection (SSI), unplanned reoperation, mortality, and length of stay (LOS). Patient profiles were entered into the ACS Surgical Risk Calculator and outcome-specific risk predictions recorded. Prediction accuracy was assessed using the Brier score and receiver-operator area under the curve (AUC). Results One hundred forty-two patients undergoing open VHR were included. ACS predictions were accurate for cardiac complications (Brier = .02), venous thromboembolism (Brier = .08), reoperation (Brier = .10), and mortality (Brier = .01). Significantly, underestimated outcomes included SSI (Brier = .14), serious complication (Brier = .30), and any complication (Brier = .34). Discrimination ranged from highly accurate (mortality, AUC=.99) to indiscriminate (SSI, AUC=.57). Predicted LOS was 3-fold shorter than observed (2.4 vs 7.4 days, P Conclusions The ACS Surgical Risk Calculator accurately predicted medical complications, reoperation, and 30-day mortality. However, SSIs, serious complications, and LOS were significantly underestimated. These findings suggest that additional considerations are needed to better estimate complications after open VHR.

33 citations


Book ChapterDOI
01 Jan 2016
TL;DR: This article is an introduction to some of the most commonly used performance measures for the evaluation of binary classifiers, and explains how to assess the statistical significance of an obtained performance value, how to calculate approximate and exact parametric confidence intervals, and how to derive percentile bootstrap confidence intervals for a performance measure.
Abstract: This article is an introduction to some of the most commonly used performance measures for the evaluation of binary classifiers. These measures are categorized into three broad families: measures based on a single classification threshold, measures based on a probabilistic interpretation of error, and ranking measures. Graphical methods, such as ROC curves, precision-recall curves, TPR-FPR plots, gain charts, and lift charts, are also discussed. Using a simple example, we illustrate how to calculate the various performance measures and show how they are related. The article also explains how to assess the statistical significance of an obtained performance value, how to calculate approximate and exact parametric confidence intervals, and how to derive percentile bootstrap confidence intervals for a performance measure.

29 citations


Journal ArticleDOI
TL;DR: The HOSPITAL score prospectively identified patients at high risk of 30-day unplanned readmission or death with good performance in medical patients in Switzerland and makes it an easy-to-use tool to target patients who might most benefit from intensive transitional care interventions.
Abstract: PRINCIPLES The HOSPITAL score is a simple prediction model that accurately identifies patients at high risk of readmission and showed good performance in an international multicentre retrospective study. We aimed to demonstrate prospectively its accuracy to predict 30-day unplanned readmission and death. METHODS We prospectively screened all consecutive patients aged ≥50 years admitted to the department of general internal medicine of a large community hospital in Switzerland. We excluded patients who refused to give consent, who died during hospitalisation, or who were transferred to another acute care, rehabilitation or palliative care facility. The primary outcome was the first unplanned readmission or death within 30 days after discharge. Some of the predictors of the original score (discharge from an oncology service and length of stay) were adapted according to the setting for practical reasons, before the start of patient inclusion. We also assessed a simplified version of the score, without the variable "any procedure performed during hospitalisation". The performance of the score was evaluated according to its overall accuracy (Brier score), its discriminatory power (C-statistic), and its calibration (Hosmer-Lemeshow goodness-of-fit test). RESULTS Among the 346 included patients, 40 (11.6%) had a 30-day unplanned readmission or death. The HOSPITAL score showed very good accuracy (Brier score 0.10), good discriminatory power (C-statistic 0.70, 95% confidence interval [CI] 0.62-0.79), and an excellent calibration (p = 0.77). Patients were classified into three risk categories for the primary outcome: low (59%), intermediate (20.8%) and high risk (20.2%). The estimated risks of unplanned readmission/death for each category were 8.2%, 11.3% and 21.6%, respectively. The simplified score showed the same performance, with a Brier score of 0.10, a C-statistic of 0.70 (95% CI 0.61-0.79), and a goodness-of-fit statistic of 0.40. CONCLUSIONS The HOSPITAL score prospectively identified patients at high risk of 30-day unplanned readmission or death with good performance in medical patients in Switzerland. Its simplicity and good performance make it an easy-to-use tool to target patients who might most benefit from intensive transitional care interventions.

29 citations


Journal ArticleDOI
TL;DR: In this article, a new mass transportation distance rank histogram is developed for assessing the reliability of unequally likely scenarios and energy scores, rank histograms and Brier scores are applied to alternative sets of scenarios that are generated by two very different methods.
Abstract: In power systems with high penetration of wind generation, probabilistic scenarios are generated for use in stochastic formulations of day-ahead unit commitment problems. To minimize the expected cost, the wind power scenarios should accurately represent the stochastic process for available wind power. We employ some statistical evaluation metrics to assess whether the scenario set possesses desirable properties that are expected to lead to a lower cost in stochastic unit commitment. A new mass transportation distance rank histogram is developed for assessing the reliability of unequally likely scenarios. Energy scores, rank histograms and Brier scores are applied to alternative sets of scenarios that are generated by two very different methods. The mass transportation distance rank histogram is best able to distinguish between sets of scenarios that are more or less calibrated according to their bias, variability and autocorrelation. Copyright © 2015 John Wiley & Sons, Ltd.

25 citations


Journal ArticleDOI
TL;DR: The authors showed that Brier scores are correlated to the accuracy of a climate model ensemble's calculation of the fraction of attributable risk (FAR), although only weakly, by constructing a modeling framework where the true FAR is already known.
Abstract: Although it is critical to assess the accuracy of attribution studies, the fraction of attributable risk (FAR) cannot be directly assessed from observations since it involves the probability of an event in a world that did not happen, the “natural” world where there was no human influence on climate. Instead, reliability diagrams (usually used to compare probabilistic forecasts to the observed frequencies of events) have been used to assess climate simulations employed for attribution and by inference to evaluate the attribution study itself. The Brier score summarizes this assessment of a model by the reliability diagram. By constructing a modeling framework where the true FAR is already known, this paper shows that Brier scores are correlated to the accuracy of a climate model ensemble’s calculation of FAR, although only weakly. This weakness exists because the diagram does not account for accuracy of simulations of the natural world. This is better represented by two reliability diagrams from e...

24 citations


Book ChapterDOI
04 Aug 2016
TL;DR: In this paper, an ensemble of optimal trees in terms of their predictive performance is proposed, which is formed by selecting the best trees from a large initial set of trees grown by random forest.
Abstract: Machine learning methods can be used for estimating the class membership probability of an observation. We propose an ensemble of optimal trees in terms of their predictive performance. This ensemble is formed by selecting the best trees from a large initial set of trees grown by random forest. A proportion of trees is selected on the basis of their individual predictive performance on out-of-bag observations. The selected trees are further assessed for their collective performance on an independent training data set. This is done by adding the trees one by one starting from the highest predictive tree. A tree is selected for the final ensemble if it increases the predictive performance of the previously combined trees. The proposed method is compared with probability estimation tree, random forest and node harvest on a number of bench mark problems using Brier score as a performance measure. In addition to reducing the number of trees in the ensemble, our method gives better results in most of the cases. The results are supported by a simulation study.

19 citations


Journal ArticleDOI
TL;DR: By dealing with multi-modal data, the proposed learning methods show effectiveness in predicting prediabetics at risk for rapid atherosclerosis progression and demonstrated utility in outcome prediction in a typical multidimensional clinical dataset with a relatively small number of subjects.
Abstract: Prediabetes is a major epidemic and is associated with adverse cardio-cerebrovascular outcomes. Early identification of patients who will develop rapid progression of atherosclerosis could be beneficial for improved risk stratification. In this paper, we investigate important factors impacting the prediction, using several machine learning methods, of rapid progression of carotid intima-media thickness in impaired glucose tolerance (IGT) participants. In the Actos Now for Prevention of Diabetes (ACT NOW) study, 382 participants with IGT underwent carotid intima-media thickness (CIMT) ultrasound evaluation at baseline and at 15–18 months, and were divided into rapid progressors (RP, n = 39, 58 ± 17.5 μM change) and non-rapid progressors (NRP, n = 343, 5.8 ± 20 μM change, p < 0.001 versus RP). To deal with complex multi-modal data consisting of demographic, clinical, and laboratory variables, we propose a general data-driven framework to investigate the ACT NOW dataset. In particular, we first employed a Fisher Score-based feature selection method to identify the most effective variables and then proposed a probabilistic Bayes-based learning method for the prediction. Comparison of the methods and factors was conducted using area under the receiver operating characteristic curve (AUC) analyses and Brier score. The experimental results show that the proposed learning methods performed well in identifying or predicting RP. Among the methods, the performance of Naive Bayes was the best (AUC 0.797, Brier score 0.085) compared to multilayer perceptron (0.729, 0.086) and random forest (0.642, 0.10). The results also show that feature selection has a significant positive impact on the data prediction performance. By dealing with multi-modal data, the proposed learning methods show effectiveness in predicting prediabetics at risk for rapid atherosclerosis progression. The proposed framework demonstrated utility in outcome prediction in a typical multidimensional clinical dataset with a relatively small number of subjects, extending the potential utility of machine learning approaches beyond extremely large-scale datasets.

14 citations


Journal ArticleDOI
TL;DR: It is shown that test-based Bayes factors can be applied to the Cox proportional hazards model and if the goal is to select a single model, then both the maximum a posteriori and the median probability model can be calculated.
Abstract: There is now a large literature on objective Bayesian model selection in the linear model based on the g-prior. The methodology has been recently extended to generalized linear models using test-based Bayes factors. In this paper, we show that test-based Bayes factors can also be applied to the Cox proportional hazards model. If the goal is to select a single model, then both the maximum a posteriori and the median probability model can be calculated. For clinical prediction of survival, we shrink the model-specific log hazard ratio estimates with subsequent calculation of the Breslow estimate of the cumulative baseline hazard function. A Bayesian model average can also be employed. We illustrate the proposed methodology with the analysis of survival data on primary biliary cirrhosis patients and the development of a clinical prediction model for future cardiovascular events based on data from the Second Manifestations of ARTerial disease (SMART) cohort study. Cross-validation is applied to compare the predictive performance with alternative model selection approaches based on Harrell's c-Index, the calibration slope and the integrated Brier score. Finally, a novel application of Bayesian variable selection to optimal conditional prediction via landmarking is described. Copyright © 2016 John Wiley & Sons, Ltd.

14 citations


Journal ArticleDOI
TL;DR: This paper showed that the Brier rule is sometimes seriously wrong about whether one cognitive state is epistemically better than another, and identified several useful monotonicity principles for epistemic betterness.
Abstract: Measures of epistemic utility are used by formal epistemologists to make determinations of epistemic betterness among cognitive states. The Brier rule is the most popular choice (by far) among formal epistemologists for such a measure. In this paper, however, we show that the Brier rule is sometimes seriously wrong about whether one cognitive state is epistemically better than another. In particular, there are cases where an agent gets evidence that definitively eliminates a false hypothesis (and the probabilities assigned to the other hypotheses stay in the same ratios), but where the Brier rule says that things have become epistemically worse. Along the way to this ‘elimination experiment’ counter-example to the Brier rule as a measure of epistemic utility, we identify several useful monotonicity principles for epistemic betterness. We also reply to several potential objections to this counter-example.

14 citations


Journal ArticleDOI
TL;DR: In this article, an approach to derive probabilistic predictions of local winter storm damage occurrences from a global medium-range ensemble prediction system (EPS) is described. But the approach is subject to large uncertainty due to meteorological forecast uncertainty and uncertainties in modelling weather impacts.
Abstract: . This paper describes an approach to derive probabilistic predictions of local winter storm damage occurrences from a global medium-range ensemble prediction system (EPS). Predictions of storm damage occurrences are subject to large uncertainty due to meteorological forecast uncertainty (typically addressed by means of ensemble predictions) and uncertainties in modelling weather impacts. The latter uncertainty arises from the fact that local vulnerabilities are not known in sufficient detail to allow for a deterministic prediction of damages, even if the forecasted gust wind speed contains no uncertainty. Thus, to estimate the damage model uncertainty, a statistical model based on logistic regression analysis is employed, relating meteorological analyses to historical damage records. A quantification of the two individual contributions (meteorological and damage model uncertainty) to the total forecast uncertainty is achieved by neglecting individual uncertainty sources and analysing resulting predictions. Results show an increase in forecast skill measured by means of a reduced Brier score if both meteorological and damage model uncertainties are taken into account. It is demonstrated that skilful predictions on district level (dividing the area of Germany into 439 administrative districts) are possible on lead times of several days. Skill is increased through the application of a proper ensemble calibration method, extending the range of lead times for which skilful damage predictions can be made.

Journal ArticleDOI
TL;DR: An extension of an approach for comparing modelling strategies in linear regression to the setting of logistic regression is presented and its application in clinical prediction research is demonstrated.
Abstract: It is often unclear which approach to fit, assess and adjust a model will yield the most accurate prediction model. We present an extension of an approach for comparing modelling strategies in linear regression to the setting of logistic regression and demonstrate its application in clinical prediction research. A framework for comparing logistic regression modelling strategies by their likelihoods was formulated using a wrapper approach. Five different strategies for modelling, including simple shrinkage methods, were compared in four empirical data sets to illustrate the concept of a priori strategy comparison. Simulations were performed in both randomly generated data and empirical data to investigate the influence of data characteristics on strategy performance. We applied the comparison framework in a case study setting. Optimal strategies were selected based on the results of a priori comparisons in a clinical data set and the performance of models built according to each strategy was assessed using the Brier score and calibration plots. The performance of modelling strategies was highly dependent on the characteristics of the development data in both linear and logistic regression settings. A priori comparisons in four empirical data sets found that no strategy consistently outperformed the others. The percentage of times that a model adjustment strategy outperformed a logistic model ranged from 3.9 to 94.9 %, depending on the strategy and data set. However, in our case study setting the a priori selection of optimal methods did not result in detectable improvement in model performance when assessed in an external data set. The performance of prediction modelling strategies is a data-dependent process and can be highly variable between data sets within the same clinical domain. A priori strategy comparison can be used to determine an optimal logistic regression modelling strategy for a given data set before selecting a final modelling approach.

Journal ArticleDOI
TL;DR: The memory demand for glaucoma classification in an unbalanced data situation based on random forests could effectively be reduced by the application of pruning strategies without loss of performance in a population with increased risk of glau coma.
Abstract: Background: Random forests are successful classifier ensemble methods consisting of typically 100 to 1000 classification trees. Ensemble pruning techniques reduce the computational cost, especially the memory demand, of random forests by reducing the number of trees without relevant loss of performance or even with increased performance of the sub-ensemble. The application to the problem of an early detection of glaucoma, a severe eye disease with low prevalence, based on topographical measurements of the eye background faces specific challenges. Objectives: We examine the performance of ensemble pruning strategies for glaucoma detection in an unbalanced data situation. Methods: The data set consists of 102 topographical features of the eye background of 254 healthy controls and 55 glaucoma patients. We compare the area under the receiver operating characteristic curve (AUC), and the Brier score on the total data set, in the majority class, and in the minority class of pruned random forest ensembles obtained with strategies based on the prediction accuracy of greedily grown sub-ensembles, the uncertainty weighted accuracy, and the similarity between single trees. To validate the findings and to examine the influence of the prevalence of glaucoma in the data set, we additionally perform a simulation study with lower prevalences of glaucoma. Results: In glaucoma classification all three pruning strategies lead to improved AUC and smaller Brier scores on the total data set with sub-ensembles as small as 30 to 80 trees compared to the classification results obtained with the full ensemble consisting of 1000 trees. In the simulation study, we were able to show that the prevalence of glaucoma is a critical factor and lower prevalence decreases the performance of our pruning strategies. Conclusions: The memory demand for glaucoma classification in an unbalanced data situation based on random forests could effectively be reduced by the application of pruning strategies without loss of performance in a population with increased risk of glaucoma.

Journal ArticleDOI
TL;DR: In this article, the authors investigate the improvements attained through postprocessing the discharge forecasts, using the archived ECMWF reforecasts for precipitation and other necessary meteorological variables, and conclude that it is valuable to apply the postprocessing method during hydrological summer.
Abstract: A hydrological ensemble prediction system is running operationally at the Royal Meteorological Institute of Belgium (RMI) for ten catchments in the Meuse basin. It makes use of the conceptual semi-distributed hydrological model SCHEME and the European Centre for Medium Range Weather Forecasts (ECMWF) ensemble prediction system (ENS). An ensemble of 51 discharge forecasts is generated daily. We investigate the improvements attained through postprocessing the discharge forecasts, using the archived ECMWF reforecasts for precipitation and other necessary meteorological variables. We use the 5-member reforecasts that have been produced since 2012, when the horizontal resolution of ENS was increased to the N320 resolution (≈30 km over Belgium). The reforecasts were issued weekly, going back 20 years, and we use a calibration window of five weeks. We use these as input to create a set of hydrological reforecasts. The implemented calibration method is an adaption of the variance inflation method. The parameters of the calibration are estimated based on the hydrological reforecasts and the observed discharge. The postprocessed forecasts are verified based on a two-and-a-half year period of data, using archived 51 member ENS forecasts. The skill is evaluated using summary scores of the ensemble mean and probabilistic scores: the Brier Score and the Continuous Ranked Probability Score (CRPS). We find that the variance inflation method gives a significant improvement in probabilistic discharge forecasts. The Brier score, which measures probabilistic skill for forecasts of discharge threshold exceedance, is improved for the entire forecast range during the hydrological summer period, and the first three days during hydrological winter. The CRPS is also significantly improved during summer, but not during winter. We conclude that it is valuable to apply the postprocessing method during hydrological summer. During winter, the method is also useful for forecasting exceedance probabilities of higher thresholds, but not for lead times beyond five days. Finally, we also note the presence of some large outliers in the postprocessed discharge forecasts, arising from the fact that the postprocessing is performed on the logarithmically transformed discharges. We suggest some ways to deal with this in the future for our operational setting.

Book ChapterDOI
19 Sep 2016
TL;DR: Proper scoring rules are proposed to be used, a well-known family of evaluation measures for assessing the goodness of probability estimators, to obtain theoretically well-founded evaluation Measures for subgroup discovery.
Abstract: Subgroup Discovery is the process of finding and describing sufficiently large subsets of a given population that have unusual distributional characteristics with regard to some target attribute. Such subgroups can be used as a statistical summary which improves on the default summary of stating the overall distribution in the population. A natural way to evaluate such summaries is to quantify the difference between predicted and empirical distribution of the target. In this paper we propose to use proper scoring rules, a well-known family of evaluation measures for assessing the goodness of probability estimators, to obtain theoretically well-founded evaluation measures for subgroup discovery. From this perspective, one subgroup is better than another if it has lower divergence of target probability estimates from the actual labels on average. We demonstrate empirically on both synthetic and real-world data that this leads to higher quality statistical summaries than the existing methods based on measures such as Weighted Relative Accuracy.

Journal Article
TL;DR: The elastic net had higher capability than the other methods for the prediction of survival time in patients with bladder cancer in the presence of competing risks base on additive hazards model.
Abstract: Background: One substantial part of microarray studies is to predict patients’ survival based on their gene expression profile. Variable selection techniques are powerful tools to handle high dimensionality in analysis of microarray data. However, these techniques have not been investigated in competing risks setting. This study aimed to investigate the performance of four sparse variable selection methods in estimating the survival time. Methods: The data included 1381 gene expression measurements and clinical information from 301 patients with bladder cancer operated in the years 1987 to 2000 in hospitals in Denmark, Sweden, Spain, France, and England. Four methods of the least absolute shrinkage and selection operator, smoothly clipped absolute deviation, the smooth integration of counting and absolute deviation and elastic net were utilized for simultaneous variable selection and estimation under an additive hazards model. The criteria of area under ROC curve, Brier score and c-index were used to compare the methods. Results: The median follow-up time for all patients was 47 months. The elastic net approach was indicated to outperform other methods. The elastic net had the lowest integrated Brier score (0.137±0.07) and the greatest median of the over-time AUC and C-index (0.803±0.06 and 0.779±0.13, respectively). Five out of 19 selected genes by the elastic net were significant (P<0.05) under an additive hazards model. It was indicated that the expression of RTN4, SON, IGF1R and CDC20 decrease the survival time, while the expression of SMARCAD1 increase it. Conclusion: The elastic net had higher capability than the other methods for the prediction of survival time in patients with bladder cancer in the presence of competing risks base on additive hazards model.

Book ChapterDOI
04 Aug 2016
TL;DR: An ensemble of k-Nearest Neighbours (kNN) classifiers for class membership probability estimation in the presence of non-informative features in the data is proposed and shows high predictive performance in terms of minimum Brier score on most of the data sets.
Abstract: Combining multiple classifiers can give substantial improvement in prediction performance of learning algorithms especially in the presence of non-informative features in the data sets. This technique can also be used for estimating class membership probabilities. We propose an ensemble of k-Nearest Neighbours (kNN) classifiers for class membership probability estimation in the presence of non-informative features in the data. This is done in two steps. Firstly, we select classifiers based upon their individual performance from a set of base kNN models, each generated on a bootstrap sample using a random feature set from the feature space of training data. Secondly, a step wise selection is used on the selected learners, and those models are added to the ensemble that maximize its predictive performance. We use bench mark data sets with some added non-informative features for the evaluation of our method. Experimental comparison of the proposed method with usual kNN, bagged kNN, random kNN and random forest shows that it leads to high predictive performance in terms of minimum Brier score on most of the data sets. The results are also verified by simulation studies.

17 Sep 2016
TL;DR: This work investigates the idea of integrating trees that are accurate and diverse and utilizes out-of-bag observation as validation sample from the training bootstrap samples to choose the best trees based on their individual performance and then assess these trees for diversity using Brier score.
Abstract: Predictive performance of a random forest ensemble is highly associated with the strength of individual trees and their diversity. Ensemble of a small number of accurate and diverse trees, if prediction accuracy is not compromised, will also reduce computational burden. We investigate the idea of integrating trees that are accurate and diverse. For this purpose, we utilize out-of-bag observation as validation sample from the training bootstrap samples to choose the best trees based on their individual performance and then assess these trees for diversity using Brier score. Starting from the first best tree, a tree is selected for the final ensemble if its addition to the forest reduces error of the trees that have already been added. A total of 35 bench mark problems on classification and regression are used to assess the performance of the proposed method and compare it with kNN, tree, random forest, node harvest and support vector machine. We compute unexplained variances and classification error rates for all the methods on the corresponding data sets. Our experiments reveal that the size of the ensemble is reduced significantly and better results are obtained in most of the cases. For further verification, a simulation study is also given where four tree style scenarios are considered to generate data sets with several structures.

Journal ArticleDOI
TL;DR: The model has good predictive ability and can be used to support a proactive medicine and stratify the population, plan clinical and preventive activities or identify the potential beneficiaries of specific health promotion projects.
Abstract: OBJECTIVES to develop and validate a predictive model of mortality or emergency hospitalization in all subjects aged 65 years and over. DESIGN cohort study based on 9 different databases linked with each other. SETTING AND PARTICIPANTS the model was developed on the population aged 65 years and over resident at 01.01.2011 for at least two years in the city of Bologna (Emilia-Romagna Region, Northern Italy); 96,000 persons were included. MAIN OUTCOME MEASURES the outcome was defined in case of emergency hospitalization or death during the one-year follow-up and studied with a logistic regression model. The predictive ability of the model was evaluated by using the area under the Roc curve, the Hosmer-Lemeshow test, and the Brier score in the derivation sample (2/3 of the population). These tests were repeated in the validation sample (1/3 of the population) and in the population of Bologna aged 65 years and over on 01.01.2012, after applying the coefficients of the variables obtained in the derivation model. By using the regression coefficients, a frailty index (risk score) was calculated for each subject later categorized in risk classes. RESULTS the model is composed of 28 variables and has good predictive abilities. The area under the Roc curve of the derivation sample is 0.77, the Hosmer-Lemeshow test is not significant, and the Brier score is 0.11. Similar performances are obtained in the other two samples. With increasing risk class, the mean age, number of hospitalizations, emergency room service consultations, and multiple drug prescriptions increase, while the average income decreases. CONCLUSION the model has good predictive ability. The frailty index can be used to support a proactive medicine and stratify the population, plan clinical and preventive activities or identify the potential beneficiaries of specific health promotion projects.

Journal ArticleDOI
TL;DR: Two versions of the Brier score are constructed to investigate the importance of clustering for the frailty survival model and show how the clustering effects and the covariate effects affect the predictive ability of the Frailty model separately.
Abstract: In this article, the Brier score is used to investigate the importance of clustering for the frailty survival model. For this purpose, two versions of the Brier score are constructed, i.e., a “conditional Brier score” and a “marginal Brier score.” Both versions of the Brier score show how the clustering effects and the covariate effects affect the predictive ability of the frailty model separately. Using a Bayesian and a likelihood approach, point estimates and 95% credible/confidence intervals are computed. The estimation properties of both procedures are evaluated in an extensive simulation study for both versions of the Brier score. Further, a validation strategy is developed to calculate an internally validated point estimate and credible/confidence interval. The ensemble of the developments is applied to a dental dataset.

Journal ArticleDOI
TL;DR: In this paper, the probabilistic short-range temperature forecasts over synoptic meteorological stations across Iran using non-homogeneous Gaussian regression (NGR) are dealt with.
Abstract: This paper deals with the probabilistic short-range temperature forecasts over synoptic meteorological stations across Iran using non-homogeneous Gaussian regression (NGR). NGR creates a Gaussian forecast probability density function (PDF) from the ensemble output. The mean of the normal predictive PDF is a bias-corrected weighted average of the ensemble members and its variance is a linear function of the raw ensemble variance. The coefficients for the mean and variance are estimated by minimizing the continuous ranked probability score (CRPS) during a training period. CRPS is a scoring rule for distributional forecasts. In the paper of Gneiting et al. (Mon Weather Rev 133:1098–1118, 2005), Broyden–Fletcher–Goldfarb–Shanno (BFGS) method is used to minimize the CRPS. Since BFGS is a conventional optimization method with its own limitations, we suggest using the particle swarm optimization (PSO), a robust meta-heuristic method, to minimize the CRPS. The ensemble prediction system used in this study consists of nine different configurations of the weather research and forecasting model for 48-h forecasts of temperature during autumn and winter 2011 and 2012. The probabilistic forecasts were evaluated using several common verification scores including Brier score, attribute diagram and rank histogram. Results show that both BFGS and PSO find the optimal solution and show the same evaluation scores, but PSO can do this with a feasible random first guess and much less computational complexity.

Journal ArticleDOI
TL;DR: In this article, the authors explored the validity of verbal probability assessments in a sequential and highly ambiguous task, that is, one in which it is virtually impossible to know or learn about the true probabilities of possible outcomes.
Abstract: The current study explores the validity of verbal probability assessments in a sequential and highly ambiguous task, that is, one in which it is virtually impossible to know or learn about the true probabilities of possible outcomes. Participants observed the pre-defined motion of an unmanned aerial vehicle (UAV), such that the participant’s success depended on the UAV reaching a target sector without being spotted by an opponent UAV. At several points in each trajectory, participants’ task was to evaluate the likelihood of reaching the target successfully. The study utilized a 2 × 2 independent-groups factorial design to examine the effect of probability incentivization (Brier vs none), in which participants receive payment based on the nearness of their predictions to actual outcomes, and informational reviews (present vs absent), in which participants engage in detailed discussion with the experimenter, regarding their assessments in seven previous trials before continuing, on probability assessment. A statistically significant main effect of Brier scoring was found, such that Brier based incentivization improved assessment accuracy. The effect of informational review and the interaction effect were not significant. All groups performed significantly better than random and uninformed performance. Outcomes from this study improve our understanding of the validity of online judgments made by operators of unmanned vehicles in strategic settings. It is concluded that non-expert probability assessments carry important information value even in ambiguous settings and even without incentives, and importantly, are further amenable to incentives and training.

17 May 2016
TL;DR: This work reviews the conceptual formulations and interpretations of the available graphical methods and summary measures for evaluating risk predictor models to provide guidance in the evaluation process that from the model development brings the risk predictor to be used in clinical practice for binary decision rules.
Abstract: The availability of novel biomarkers in several branch of medicine opens room for refining prognosis by adding factors on top of those having an established role. It is accepted that the impact of novel factors should not rely solely on regression coefficients and their significance. This motivated the fruitful literature in the last decades proposing predictive power measures, such as Brier Score, ROC based quantities, net benefit and related inference. This work reviews the conceptual formulations and interpretations of the available graphical methods and summary measures for evaluating risk predictor models. The aim is to provide guidance in the evaluation process that from the model development brings the risk predictor to be used in practice.

Posted ContentDOI
TL;DR: In this article, the authors developed binary choice models to focus on the decision made by a sample of U.S. households to purchase various non-alcoholic beverages, and evaluated the probabilities generated through those qualitative choice models using an array of techniques such as expectation-prediction success tables; receiver operating characteristics (ROC) curve, Kullback-Leibler Information criteria; calibration; resolution (sorting); the Brier score; and the Yates partition of the Briers score.
Abstract: Using data from Nielsen HomeScan scanner panel for calendar year 2003, we develop binary choice models to focus on the decision made by a sample of U.S. households to purchase various non-alcoholic beverages. We evaluate the probabilities generated through those qualitative choice models using an array of techniques such as expectation-prediction success tables; receiver operating characteristics (ROC) curve, Kullback-Leibler Information criteria; calibration; resolution (sorting); the Brier score; and the Yates partition of the Brier score. In using expectation-prediction success tables, we paid attention to sensitivity and specificity. Use of a naive 0.50 cut-off to classify probabilities resulted in the over or under estimation of sensitivity and specificity values compared to the use of the market penetration value. Area under the ROC curve is suggested as an alternative to the use of 0.5 cut-off as well as cut-off at market penetration level to classify probabilities, because this method treats a wide range of cut-off probabilities to come up with a coherent measure in classifying probabilities. The area under the ROC was highest for coffee for with-in-sample probabilities while it was highest for fruit juice model for out-of-sample probabilities. Kullback-Leibler Information Criteria which selects the model with the highest log-likelihood function value observed at out-of-sample observations (OSLLF) to evaluate probabilities show “closeness” or deviation of model generated probabilities to the true data generating probability overall, although this method does not offer classification of probabilities for events that occurred versus that did not. Again, with respect to OSLLF value, probabilities associated with fruit juice model outperform all other beverages. Forecast probabilities with respect to most of the beverage purchases were well calibrated. All resolution graphs were almost flat against a 45-degree perfect resolution graph, indicative of poor sorting power of choice models. The Brier score was lowest for fruit juices and the highest for low-fat milk. According to the calculated Brier score, probability forecasts for fruit juices outperformed other non-alcoholic beverages.Although the Brier score gave an overall indication of the ability of a model to forecast accurately, the components of the Yates decomposition of the Brier score provided a clearer and broader indication of the ability of the model to forecast. With-in-sample probabilities generated through logit model for coffee outperforms probabilities generated for other beverages based on area under the ROC curve, covariance between probabilities and outcome index and slope of covariance. Out-of-sample probabilities generated through logit model for fruit juice performs better than any other beverage category based on area under the ROC curve, Brier Score, and OSLLF value. In the event where researchers are confronted with alternative models that issue probability forecasts, the accuracy of probability forecasts in determining the best model can be measured through myriad of metrics. Even though traditional measures such as expectation-prediction success tables, calibration and log-likelihood approaches are still used, ROC charts, resolution, the Brier score and the Yates partition of the Brier score to evaluate probabilities generated through alternative models are highly recommended.

Posted ContentDOI
01 Feb 2016
TL;DR: Results show the higher the substantive knowledge, higher the model’s ability to offer a high probability for events occurred versus low probability for Events that did not occur, and better sorting of probabilities was demonstrated in the model with more substantive knowledge.
Abstract: Clear understanding of “goodness” and how substantive knowledge contributes to such goodness is generally absent from the economist’s use of probability. Although probability forecast from either subjective experts or from data based on prior theory and models can be generated, it is more problematic to generate a “good probability forecast” with a crisp understanding of what constitutes “good”. Further it is generally not clear to economists how different conditioning information affects this measure of “good.” Heretofore probability forecasts have been evaluated using the Brier Score and its Yates partition. Our work contributes by exploring how different sets of substantive information affect the Brier score and each component of the Yates partition. We will explore partitions associated with a set of observational data on beverages and the associated consumer decision to purchase. Probabilities are modeled using discrete choice models. Results show the higher the substantive knowledge, higher the model’s ability to offer a high probability for events occurred versus low probability for events that did not occur. Also, this model gave rise to lower Brier Score (lower the better) and higher covariance between probabilities offered and events observed. Better sorting of probabilities was demonstrated in the model with more substantive knowledge.

Journal ArticleDOI
TL;DR: In this paper, the verification of the probabilistic rainfall forecast obtained from the National Centre for Medium Range Weather Forecasting (NCMRWF) Global Ensemble Forecast system (NGEFS) for three monsoon seasons, i.e., JJAS 2012, 2013 and 2014, was done based on the Brier Score (BS), reliability diagram, relative operating characteristic (ROC) curve and area under the ROC (AROC).
Abstract: Forecasting rainfall in the tropics is a challenging task further hampered by the uncertainty in the numerical weather prediction models. Ensemble prediction systems (EPSs) provide an efficient way of handling the inherent uncertainty of these models. Verification of forecasts obtained from an EPS is a necessity, to build confidence in using these forecasts. This study deals with the verification of the probabilistic rainfall forecast obtained from the National Centre for Medium Range Weather Forecasting (NCMRWF) Global Ensemble Forecast system (NGEFS) for three monsoon seasons, i.e., JJAS 2012, 2013 and 2014. Verification is done based on the Brier Score (BS) and its components (reliability, resolution and uncertainty), Brier Skill Score (BSS), reliability diagram, relative operating characteristic (ROC) curve and area under the ROC (AROC) curve. Three observation data sets are used (namely, NMSG, CPC-RFE2.0 and TRMM) for verification of forecasts and the statistics are compared. BS values for verification of NGEFS forecasts using NMSG data are the lowest, indicating that the forecasts have a better match with these observations as compared to both TRMM and CPC-RFE2.0. This is further strengthened by lower reliability, higher resolution and BSS values for verification against this data set. The ROC curve shows that lower rainfall amounts have a higher hit rate, which implies that the model has better skill in predicting these rainfall amounts. The reliability plots show that the events with lower probabilities were under forecasted and those with higher probabilities were over forecasted. From the current study it can be concluded that even though NGEFS is a coarse resolution EPS, the probabilistic forecast has good skill. This in turn leads to an increased confidence in issuing operational probabilistic forecasts based on NGEFS.

Journal ArticleDOI
TL;DR: In this article, the authors propose autocorrelation-robust asymptotic variances of the Brier score and Brier skill score, which are generally applicable in circumstances with weak serial correlation.

Proceedings ArticleDOI
01 Jul 2016
TL;DR: An expert system based on production rules to define NPTs with the purpose of enabling the definition of N PTs by experts with no ranked nodes-specific knowledge is presented, and a practitioner can accurately define Npts without understanding the concept of ranked nodes.
Abstract: One of the key challenges in constructing a Bayesian network BN is defining the node probability tables (NPT). For large-scale BN, learning NPT through domain experts knowledge elicitation is unfeasible. Previous works proposed solutions to this problem using the concept of ranked nodes; however, they have limited modeling capabilities or rely on BN experts to apply them, reducing their applicability. In this paper, we present an expert system based on production rules to define NPTs with the purpose of enabling the definition of NPTs by experts with no ranked nodes-specific knowledge. To create the rules, we elicited data from an expert in ranked nodes. To validate our approach, we executed an experiment with a BN already published in the literature to verify if, with our approach, a practitioner can achieve the same or better configuration for the NPTs. We used the Brier score to assess the NPTs accuracy and evaluated the results with the Wilcoxon test. All the Wilcoxon tests executed rejected the null hypotheses that stated that the Brier scores for the original NPTs method were the same as the new NPTs. By using our solution, a practitioner can accurately define NPTs without understanding the concept of ranked nodes.

Journal ArticleDOI
TL;DR: It is observed that the method of minimizing inverse relative entropy seems to work better than (or at least equally well as) its competitors in many situations.
Abstract: We tackle two open questions from Leitgeb and Pettigrew (2010b) regarding what the belief update framework described in that paper mandates as correct responses to two problems. One of them concerns credences in overlapping propositions and is known in the literature as the “simultaneous update problem”. The other is the well known “Judy Benjamin” problem concerning conditional credences. We argue that our results concerning the problems point to deficiencies of the framework. More generally, we observe that the method of minimizing inverse relative entropy seems to work better than (or at least equally well as) its competitors in many situations.