scispace - formally typeset
Search or ask a question

Showing papers on "Brier score published in 2009"


Journal ArticleDOI
TL;DR: It is demonstrated that resolution and reliability are directly related to forecast attributes that are desirable on grounds independent of the notion of scores, which can be considered an epistemological justification of measuring forecast quality by proper scoring rules.
Abstract: Scoring rules are an important tool for evaluating the performance of probabilistic forecasting schemes. A scoring rule is called strictly proper if its expectation is optimal if and only if the forecast probability represents the true distribution of the target. In the binary case, strictly proper scoring rules allow for a decomposition into terms related to the resolution and the reliability of a forecast. This fact is particularly well known for the Brier Score. In this article, this result is extended to forecasts for finite-valued targets. Both resolution and reliability are shown to have a positive effect on the score. It is demonstrated that resolution and reliability are directly related to forecast attributes that are desirable on grounds independent of the notion of scores. This finding can be considered an epistemological justification of measuring forecast quality by proper scoring rules. A link is provided to the original work of DeGroot and Fienberg, extending their concepts of sufficiency and refinement. The relation to the conjectured sharpness principle of Gneiting, et al., is elucidated.

91 citations


Journal ArticleDOI
TL;DR: An algorithm for learning optimal nondeterministic hypotheses is derived and the quality of posterior probabilities (measured by the Brier score) determines the goodness of nond deterministic predictions.
Abstract: Nondeterministic classifiers are defined as those allowed to predict more than one class for some entries from an input space. Given that the true class should be included in predictions and the number of classes predicted should be as small as possible, these kind of classifiers can be considered as Information Retrieval (IR) procedures. In this paper, we propose a family of IR loss functions to measure the performance of nondeterministic learners. After discussing such measures, we derive an algorithm for learning optimal nondeterministic hypotheses. Given an entry from the input space, the algorithm requires the posterior probabilities to compute the subset of classes with the lowest expected loss. From a general point of view, nondeterministic classifiers provide an improvement in the proportion of predictions that include the true class compared to their deterministic counterparts; the price to be paid for this increase is usually a tiny proportion of predictions with more than one class. The paper includes an extensive experimental study using three deterministic learners to estimate posterior probabilities: a multiclass Support Vector Machine (SVM), a Logistic Regression, and a Naive Bayes. The data sets considered comprise both UCI multi-class learning tasks and microarray expressions of different kinds of cancer. We successfully compare nondeterministic classifiers with other alternative approaches. Additionally, we shall see how the quality of posterior probabilities (measured by the Brier score) determines the goodness of nondeterministic predictions.

74 citations


Book ChapterDOI
03 Oct 2009
TL;DR: The converse: a strategy that approaches a convex B-set can be derived from the construction of a calibrated strategy, which is developed in the framework of a game with partial monitoring to define a notion of internal regret and construct strategies that have no such regret.
Abstract: A calibrated strategy can be obtained by performing a strategy that has no internal regret in some auxiliary game. Such a strategy can be constructed explicitly with the use of Blackwell's approachability theorem, in an other auxiliary game. We establish the converse: a strategy that approaches a convex B-set can be derived from the construction of a calibrated strategy. We develop these tools in the framework of a game with partial monitoring, where players do not observe the actions of their opponents but receive random signals, to define a notion of internal regret and construct strategies that have no such regret.

35 citations


Posted Content
TL;DR: In this paper, the authors describe several performance measures for risk models, and show how they are related, and propose a new way to identify these individuals, and how to quantify how much they gain by measuring the additional covariates.
Abstract: Interest in targeted disease prevention has stimulated development of models that assign risks to individuals, using their personal covariates. We need to evaluate these models, and to quantify the gains achieved by expanding a model with additional covariates. We describe several performance measures for risk models, and show how they are related. Application of the measures to risk models for hypothetical populations and for postmenopausal US women illustrate several points. First, model performance is constrained by the distribution of true risks in the population. This complicates the comparison of two models if they are applied to populations with different covariate distributions. Second, the Brier Score and the Integrated Discrimination Improvement (IDI) are more useful than the concordance statistic for quantifying precision gains obtained from model expansion. Finally, these precision gains are apt to be small, although they may be large for some individuals. We propose a new way to identify these individuals, and show how to quantify how much they gain by measuring the additional covariates. Those with largest gains could be targeted for cost-efficient covariate assessment.

22 citations


Journal ArticleDOI
TL;DR: In this article, a storm surge model with input from ECMWF's Ensemble Prediction System was used to forecast sea level probability for at least 2-5 days ahead of a storm.
Abstract: Sea level probability forecasts are generated by means of the WAQUA/DCSM98 storm surge model with input from ECMWF's Ensemble Prediction System For optimum performance with ECMWF input, the model needs to be recalibrated to overcome a systematic underprediction of higher wind speeds Moreover, an additional calibration of the ensemble with the aid of Rank Histograms is performed With the calibration, Brier skill scores show that useful probability forecasts can be made for at least 2–5 days ahead The system runs in experimental real-time mode and results are available on the Internet for forecasters, who used it for the first time during a storm in March 2007

20 citations


Journal ArticleDOI
TL;DR: In this paper, the authors applied longitudinal quadratic discriminant analysis and examined various measures, mainly derived from the Brier Score, to assess the biomarker performance in terms of discrimination and calibration.
Abstract: To classify patients either as resistant or non-resistant to HIV therapy based on longitudinal viral load profiles, we applied longitudinal quadratic discriminant analysis and examined various measures, mainly derived from the Brier Score, to assess the biomarker performance in terms of discrimination and calibration. The analysis of the application data revealed an increase in performance by using longer profiles instead of single biomarker measurements. Simulations showed that the selection of mixed models for the estimation of the group-specific discriminant rule parameters should be based on BIC, rather than on the best performance measure. An incorrect model selection can lead to spuriously better or worse performance as misclassification and classification certainty regards, especially with increasing length of the profiles and for more complex models with random slopes.

15 citations


Journal ArticleDOI
TL;DR: In this article, four alternative approximate bases for the ensemble transform (ET) are obtained by extending the cycling interval to 24, 48, 72 and 96 h, and another alternative basis is obtained by foregoing cycling and instead drawing randomly generated perturbations from an archive.
Abstract: Four alternative approximate bases for the ensemble transform (ET) are obtained by extending the cycling interval to 24, 48, 72 and 96 h. Another alternative basis is obtained by foregoing cycling and instead drawing randomly generated perturbations from an archive. Experiments based upon 16-member global ensembles and a diagonal estimate of analysis-error covariance indicate that the alternative bases are effective at reducing the discrepancy between the ET analysis-perturbation variance and the estimated analysis-error variance. Forecast ensembles associated with the alternative bases maintain considerably more energy in the tropics and subtropics than the forecast ensemble associated with the original basis. Forecast ensembles associated with the alternative bases also outperform the original forecast ensemble in terms of the ensemble forecast-error covariance-matrix eigenvalue spectrum, the relationship between ensemble variance and observed squared error and the Brier score. The performance gains facilitated by the alternative approximate bases are substantial in some instances, especially in the tropics. The randomly sampled basis is superior to the original basis in most respects. Published in 2009 by John Wiley & Sons, Ltd.

12 citations


Book ChapterDOI
01 Jan 2009
TL;DR: A probabilistic epileptic seizure predictor based on a combination of feature channels derived from the intracranial electroencephalogram by a logistic regression map and an according method for its statistical evaluation is presented.
Abstract: We present a probabilistic epileptic seizure predictor and an according method for its statistical evaluation. The probabilistic predictor is based on a combination of feature channels, which are derived from the intracranial electroencephalogram (EEG), by a logistic regression map. The evaluation is done by the Brier score, an established assessment method in meteorology, which quantifies the prediction error. From the prediction features, the weights of the logistic regression are learned in a training phase and in a test phase the Brier score is assessed. A test for significance of the probabilistic predictor, based on seizure time surrogates, is computed. For 3 of 5 patients we obtained significant predictive power with the mean phase coherence and with the dynamical similarity index we obtained for 2 of the 5 patients significant results. The concept of probabilistic prediction can be a valuable tool for the development of future seizure intervention systems.

11 citations


Book ChapterDOI
23 Sep 2009
TL;DR: A new calibration method inspired in binning-based methods in which the calibrated probabilities are obtained from k instances from a dataset is proposed, which outperforms the most commonly used calibration methods.
Abstract: In this paper we revisit the problem of classifier calibration, motivated by the issue that existing calibration methods ignore the problem attributes (i.e., they are univariate). We propose a new calibration method inspired in binning-based methods in which the calibrated probabilities are obtained from k instances from a dataset. Bins are constructed by including the k-most similar instances, considering not only estimated probabilities but also the original attributes. This method has been tested wrt. two calibration measures, including a comparison with other traditional calibration methods. The results show that the new method outperforms the most commonly used calibration methods.

10 citations


Journal ArticleDOI
TL;DR: This work investigates different statistical measures and recommends a strategy based on the Brier Score, a measure of prediction inaccuracy on individual survival, which is flexible and easily applied with common statistical software.

4 citations


Proceedings ArticleDOI
01 Jan 2009
TL;DR: This work studies the impact of using dynamic information as features in a machine learning algorithm for the prediction task of classifying critically ill patients in two classes according to the time they need to reach a stable state after coronary bypass surgery: less or more than nine hours.
Abstract: This work studies the impact of using dynamic information as features in a machine learning algorithm for the prediction task of classifying critically ill patients in two classes according to the time they need to reach a stable state after coronary bypass surgery: less or more than nine hours. On the basis of five physiological variables different dynamic features were extracted. These sets of features served subsequently as inputs for a Gaussian process and the prediction results were compared with the case where only admission data was used for the classification. The dynamic features, especially the cepstral coefficients (aROC: 0.749, Brier score: 0.206), resulted in higher performances when compared to static admission data (aROC: 0.547, Brier score: 0.247). In all cases, the Gaussian process classifier outperformed logistic regression.

Journal Article
TL;DR: In this article, the authors performed experiments of 15 ensemble members by using different model physical process parameterization schemes and identical initial values for rainy season in July 2003, and multi-model short-range ensemble precipitation probability forecasts are made by means of "Average", correlation and "Rank"Results indicate that the ensemble precipitation prediction made by the three methods above-mentioned all can give accurate estimation of center and region of the precipitation, and the "Rank is superior to the "Average" and "Correlation" for performing better in forecasting the areas, intensity and boundary of precipitation but the other
Abstract: Experiments of 15 ensemble members are performed by using AREM,MM5 and WRF models with different model physical process parameterization schemes and identical initial values for rainy season in July 2003,and multi-model short-range ensemble precipitation probability forecasts are made by means of "Average","Correlation" and "Rank"Results indicate that the ensemble precipitation probability forecasts made by the three methods above-mentioned all can give accurate estimation of center and region of the precipitation,and the "Rank" is superior to the "Average" and "Correlation" for performing better in forecasting the areas,intensity and boundary of precipitation but the other two expand improper areasEvaluation results of ranked probability score(RPS),Brier score(BS) and relative operating characteristic(ROC) show there is little difference in a certain critical grade precipitation of the "Rank" results and that of the other two,but for synthetic effect of someday in the rainy season the "Rank" surpasses the "Average" method and "Correlation" method obviouslyThe higher score of RPS and every grade BS for the heavier and wider precipitation case means that multi-model short-range ensemble precipitation probability forecasts are a challenge

01 Jan 2009
TL;DR: In the paper, comparative analysis of ensembles of dipolar neural networks and regression trees was conducted, based on the dipolar criterion function, and the methods return aggregated Kaplan-Meier survival function.
Abstract: In the paper, comparative analysis of ensembles of dipolar neural networks and regression trees was conducted. The techniques are based on the dipolar criterion function. Appropriate formation of dipoles (pairs of feature vectors) allows using them for analysis of censored survival data. As the result the methods return aggregated Kaplan-Meier survival function. The results, obtained by neural networks and regression trees based ensembles, are compared by using Brier score and direct and indirect measures of predictive accuracy.

Journal Article
TL;DR: The experimental results show that the optimal feature subset derived enhances the predictive power of a classifier and reduces false positive and false negative rates as measured by the sensitivity and specificity of the classifier.
Abstract: Advancement in data mining and machine learning has promoted computer-based approaches such as Computer-aided diagnosis, Expert systems and Prognostic studies in medical applications. Medical data are processed and analyzed using data mining techniques to derive useful knowledge. These data are multidimensional, and represented by a large number of features. The irrelevant and redundant features among them may negatively impact the performance of the data mining algorithms. Feature selection identifies the features that improve the predictive accuracy of the classifiers. The proposed work focuses on identifying the significant features that influence the predictive accuracy of the Naive Bayes Classifier using the visualization tool, Nomogram. The effect of each feature on the performance of the classifier is analyzed using nomogram and an optimal feature subset that enhances the predictive accuracy is derived. The proposed method, Nomogram-RFE, is experimented with Pima Indians Diabetes Dataset and the performance of the classifier is evaluated on five criteria: classification accuracy, sensitivity, specificity, the area under the receiver operating characteristic and Brier score. The experimental results show that the optimal feature subset derived enhances the predictive power of a classifier and reduces false positive and false negative rates as measured by the sensitivity and specificity of the classifier. A low Brier score for the optimal feature subset indicates lower deviation between the predicted probability and the actual outcome.