scispace - formally typeset
Search or ask a question

Showing papers on "Brier score published in 2008"


Journal ArticleDOI
TL;DR: A systematic review of the modern way of assessing risk prediction models using methods derived from ROC methodology and from probability forecasting theory to compare measures of predictive performance.
Abstract: For medical decision making and patient information, predictions of future status variables play an important role. Risk prediction models can be derived with many different statistical approaches. To compare them, measures of predictive performance are derived from ROC methodology and from probability forecasting theory. These tools can be applied to assess single markers, multivariable regression models and complex model selection algorithms. This article provides a systematic review of the modern way of assessing risk prediction models. Particular attention is put on proper benchmarks and resampling techniques that are important for the interpretation of measured performance. All methods are illustrated with data from a clinical study in head and neck cancer patients.

249 citations


Journal ArticleDOI
TL;DR: Four recent papers have investigated the effects of ensemble size on the Brier score (BS) and discrete ranked probability score (RPS) attained by ensemble-based probabilistic forecasts and expressions, explanations and estimators are obtained.
Abstract: Four recent papers have investigated the effects of ensemble size on the Brier score (BS) and discrete ranked probability score (RPS) attained by ensemble-based probabilistic forecasts. The connections between these papers are described and their results are generalized. In particular, expressions, explanations and estimators for the expected effect of ensemble size on the RPS and continuous ranked probability score (CRPS) are obtained. Copyright © 2008 Royal Meteorological Society

120 citations


Journal ArticleDOI
TL;DR: In this paper, it is shown that two within-bin components are needed in addition to the three traditional components of the Brier score to make a generalized resolution component that is less sensitive to choice of bin width than is th...
Abstract: The Brier score is widely used for the verification of probability forecasts. It also forms the basis of other frequently used probability scores such as the rank probability score. By conditioning (stratifying) on the issued forecast probabilities, the Brier score can be decomposed into the sum of three components: uncertainty, reliability, and resolution. This Brier score decomposition can provide useful information to the forecast provider about how the forecasts can be improved. Rather than stratify on all values of issued probability, it is common practice to calculate the Brier score components by first partitioning the issued probabilities into a small set of bins. This note shows that for such a procedure, an additional two within-bin components are needed in addition to the three traditional components of the Brier score. The two new components can be combined with the resolution component to make a generalized resolution component that is less sensitive to choice of bin width than is th...

78 citations


Journal ArticleDOI
TL;DR: Adverse events, taken from bedside monitored data, are important intermediate outcomes, contributing to a timely recognition of organ dysfunction and failure during ICU length of stay, thus making room for the development of intelligent clinical alarm monitoring.

76 citations


Journal ArticleDOI
TL;DR: In this paper, the Brier score and Brier skill score are used for verification of forecast accuracy and skill using sampling theory, and analytical expressions are derived to estimate their sampling uncertainties.
Abstract: For probability forecasts, the Brier score and Brier skill score are commonly used verification measures of forecast accuracy and skill. Using sampling theory, analytical expressions are derived to estimate their sampling uncertainties. The Brier score is an unbiased estimator of the accuracy, and an exact expression defines its sampling variance. The Brier skill score (with climatology as a reference forecast) is a biased estimator, and approximations are needed to estimate its bias and sampling variance. The uncertainty estimators depend only on the moments of the forecasts and observations, so it is easy to routinely compute them at the same time as the Brier score and skill score. The resulting uncertainty estimates can be used to construct error bars or confidence intervals for the verification measures, or perform hypothesis testing. Monte Carlo experiments using synthetic forecasting examples illustrate the performance of the expressions. In general, the estimates provide very reliable inf...

73 citations


Journal ArticleDOI
TL;DR: It is demonstrated in a simulation study that complexity selection in conventional bootstrap samples, drawn with replacement, is severely biased in many scenarios, which translates into a considerable bias of prediction error estimates, often underestimating the amount of information that can be extracted from high-dimensional data.
Abstract: The bootstrap is a tool that allows for efficient evaluation of prediction performance of statistical techniques without having to set aside data for validation. This is especially important for high-dimensional data, e.g., arising from microarrays, because there the number of observations is often limited. For avoiding overoptimism the statistical technique to be evaluated has to be applied to every bootstrap sample in the same manner it would be used on new data. This includes a selection of complexity, e.g., the number of boosting steps for gradient boosting algorithms. Using the latter, we demonstrate in a simulation study that complexity selection in conventional bootstrap samples, drawn with replacement, is severely biased in many scenarios. This translates into a considerable bias of prediction error estimates, often underestimating the amount of information that can be extracted from high-dimensional data. Potential remedies for this complexity selection bias, such as alternatively using a fixed level of complexity or of using sampling without replacement are investigated and it is shown that the latter works well in many settings. We focus on high-dimensional binary response data, with bootstrap .632+ estimates of the Brier score for performance evaluation, and censored time-to-event data with .632+ prediction error curve estimates. The latter, with the modified bootstrap procedure, is then applied to an example with microarray data from patients with diffuse large B-cell lymphoma.

70 citations


Journal ArticleDOI
TL;DR: Contingency analyses on categorical forecasts show that the proposed multimodel combination technique reduces average Brier score and total number of false alarms, resulting in improved reliability of forecasts, and adding climatological ensembles improves the multimodels performance resulting in reduced average RPS.
Abstract: A new approach for developing multimodel streamflow forecasts is presented. The methodology combines streamflow forecasts from individual models by evaluating their skill, represented by rank probability score (RPS), contingent on the predictor state. Using average RPS estimated over the chosen neighbors in the predictor state space, the methodology assigns higher weights for a model that has better predictability under similar predictor conditions. We assess the performance of the proposed algorithm by developing multimodel streamflow forecasts for Falls Lake Reservoir in the Neuse River Basin, North Carolina (NC), by combining streamflow forecasts developed from two low-dimensional statistical models that use sea-surface temperature conditions as underlying predictors. To evaluate the proposed scheme thoroughly, we consider a total of seven multimodels that include existing multimodel combination techniques such as combining based on long-term predictability of individual models and by simple pooling of ensembles. Detailed nonparametric hypothesis tests comparing the performance of seven multimodels with two individual models show that the reduced RPS from multimodel forecasts developed using the proposed algorithm is statistically significant from the RPSs of individual models and from the RPSs of existing multimodel techniques. The study also shows that adding climatological ensembles improves the multimodel performance resulting in reduced average RPS. Contingency analyses on categorical (tercile) forecasts show that the proposed multimodel combination technique reduces average Brier score and total number of false alarms, resulting in improved reliability of forecasts. However, adding multiple models with climatology also increases the number of missed targets (in comparison to individual models' forecasts) which primarily results from the reduction of increased resolution that is exhibited in the individual models' forecasts under various forecast probabilities.

56 citations


Journal ArticleDOI
TL;DR: In this article, two methods are considered for taking observation errors into account: perturbed-ensemble and observational-probability methods, where the predicted ensemble elements are randomly perturbed in a way that is consistent with the assumed observation error.
Abstract: Ensemble prediction systems (EPSs) are usually validated under the assumption that the verifying observations are exact. In this paper, two methods are considered for taking observation errors into account. In the ‘perturbed-ensemble’ method, which has already been studied by other authors, the predicted ensemble elements are randomly perturbed in a way that is consistent with the assumed observation error. In the ‘observational-probability’ method, which is new, a verifying observation is considered as defining, together with the assumed associated error, a probability distribution. All standard scores for evaluation of EPSs (reliability diagram, Brier score, ranked probability score (RPS), continuous RPS (CRPS), relative-operating-characteristics (ROC) curve area), with the exception of the rank histogram, remain defined in this second method. In particular, the classical reliability–resolution decomposition of the Brier score, and of its extension to the RPS and CRPS, remain defined. Numerical simulations, partially supported by theoretical considerations, show that, with respect to the case when observation errors are ignored, the perturbed-ensemble method improves reliability, as well as the ROC score, while it has no significant impact on resolution, as measured by the Brier score. The observational-probability method, on the other hand, degrades reliability and the ROC score, but improves resolution. With respect to the ‘real’ performance of the system (i.e. the one that would be diagnosed if no error were present), reliability is unchanged in the perturbed-ensemble method, while resolution and the ROC score are degraded. The observational-probability method degrades reliability and the ROC score. As for resolution, an optimum value of the observational error is found, below which resolution is improved. Diagnostics performed on the operational EPS of the Canadian Meteorological Centre confirm the results of the simulations as to the consequences of ignoring observation errors, or on the contrary of taking them into account through either of the two methods. The significance of those various results is discussed. This article replaces a previously published version (Q. J. R. Meteorol. Soc.134(631): 509–521, DOI: 10.1002/qj.221). Copyright © 2008 Royal Meteorological Society

48 citations


Proceedings ArticleDOI
11 Dec 2008
TL;DR: The experiment shows that random forests of PETs calibrated by the novel method significantly outperform uncalibratedrandom forests of both PETs and classification trees, as well as random forests calibrated with the two standard methods, with respect to the squared error of predicted class probabilities.
Abstract: When using the output of classifiers to calculate the expected utility of different alternatives in decision situations, the correctness of predicted class probabilities may be of crucial importance. However, even very accurate classifiers may output class probabilities of rather poor quality. One way of overcoming this problem is by means of calibration, i.e., mapping the original class probabilities to more accurate ones. Previous studies have however indicated that random forests are difficult to calibrate by standard calibration methods. In this work, a novel calibration method is introduced, which is based on a recent finding that probabilities predicted by forests of classification trees have a lower squared error compared to those predicted by forests of probability estimation trees (PETs). The novel calibration method is compared to the two standard methods, Platt scaling and isotonic regression, on 34 datasets from the UCI repository. The experiment shows that random forests of PETs calibrated by the novel method significantly outperform uncalibrated random forests of both PETs and classification trees, as well as random forests calibrated with the two standard methods, with respect to the squared error of predicted class probabilities.

48 citations


Journal ArticleDOI
TL;DR: Predictive models that also incorporate univariate patterns of the six individual organ systems underlining the sequential organ-failure assessment score improve predictions' quality in terms of both discrimination and calibration and enhance the interpretability of models.

25 citations


Journal ArticleDOI
TL;DR: In this article, it is argued that these errors can become quite substantial if individual sample points have too large influence on the estimate, which can be avoided by using regularization techniques.
Abstract: Studies on forecast evaluation often rely on estimating limiting observed frequencies conditioned on specific forecast probabilities (the reliability diagram or calibration function). Obviously, statistical estimates of the calibration function are based on only limited amounts of data and therefore contain residual errors. Although errors and variations of calibration function estimates have been studied previously, either they are often assumed to be small or unimportant, or they are ignored altogether. It is demonstrated how these errors can be described in terms of bias and variance, two concepts well known in the statistics literature. Bias and variance adversely affect estimates of the reliability and sharpness terms of the Brier score, recalibration of forecasts, and the assessment of forecast reliability through reliability diagram plots. Ways to communicate and appreciate these errors are presented. It is argued that these errors can become quite substantial if individual sample points have too large influence on the estimate, which can be avoided by using regularization techniques. As an illustration, it is discussed how to choose an appropriate bin size in the binning and counting method, and an appropriate bandwidth parameter for kernel estimates.

Journal ArticleDOI
TL;DR: Investigating how simplification of the forecast probabilities can affect the forecast quality of probabilistic predictions as measured by the Brier score suggests that forecast quality should be made available for the set of probabilities that the forecast user has access to as well as for the complete set of probability issued by the ensemble forecasting system.
Abstract: Probability forecasts from an ensemble are often discretized into a small set of categories before being distributed to the users. This study investigates how such simplification can affect the forecast quality of probabilistic predictions as measured by the Brier score (BS). An example from the European Centre for Medium-Range Weather Forecasts (ECMWF) operational seasonal ensemble forecast system is used to show that the simplification of the forecast probabilities reduces the Brier skill score (BSS) by as much as 57% with respect to the skill score obtained with the full set of probabilities issued from the ensemble. This is more obvious for a small number of probability categories and is mainly due to a decrease in forecast resolution of up to 36%. The impact of the simplification as a function of the ensemble size is also discussed. The results suggest that forecast quality should be made available for the set of probabilities that the forecast user has access to as well as for the complete set of probabilities issued by the ensemble forecasting system. Copyright © 2008 Royal Meteorological Society

Proceedings ArticleDOI
26 Sep 2008
TL;DR: The empirical evaluation shows that the choice of combination rule can have a significant impact on the performance for a single dataset, but in general the evidential combination rules do not perform better than the voting rules for this particular ensemble design.
Abstract: Ensemble classifiers are known to generally perform better than each individual classifier of which they consist. One approach to classifier fusion is to apply Shaferpsilas theory of evidence. While most approaches have adopted Dempsterpsilas rule of combination, a multitude of combination rules have been proposed. A number of combination rules as well as two voting rules are compared when used in conjunction with a specific kind of ensemble classifier, known as random forests, w.r.t. accuracy, area under ROC curve and Brier score on 27 datasets. The empirical evaluation shows that the choice of combination rule can have a significant impact on the performance for a single dataset, but in general the evidential combination rules do not perform better than the voting rules for this particular ensemble design. Furthermore, among the evidential rules, the associative ones appear to have better performance than the non-associative.

Posted Content
TL;DR: In this paper, the authors argue that likelihood-based measuresprovide a simple and natural general framework for the evaluation of all kinds of probabilistic forecast, and they describe a number of different scores based on the likelihood and investigate the relationships between the likelihood, the mean square error and the ignorance.
Abstract: We define the likelihood and give a number of justifications for its use as a skill measure forprobabilistic forecasts. We describe a number of different scores based on the likelihood, and brieflyinvestigate the relationships between the likelihood, the mean square error and the ignorance. 1 Introduction Users of forecasts need to know:• whether the forecasts they are receiving have been adequately calibrated• whether the forecasts they are receiving are any better than an appropriate simple model such asclimatology• which of the forecasts they are receiving is the bestTo answer these questions, a single measure of forecast quality is needed. For calibration, the measureserves as a cost or benefit function that must be minimized or maximised in order to find the optimumvalues for the free parameters in the calibration algorithm. For comparison with climatology or otherforecasts, the measure serves as a way of deriving a ranking.There are many standard measures of forecast quality. For example, for calibrating and comparing single-valued temperature forecasts, mean square error (MSE) is common. For binary probabilistic forecasts,the Brier score (Brier, 1950) is often used. For continuous probability forecasts, the continuous rankprobability score and the ignorance have been suggested.In this paper we will argue that likelihood-based measuresprovide a simple and natural general frameworkfor the evaluation of all kinds of probabilistic forecast. For example, likelihood based measures can beused for binary and continuous probability forecasts, for temperature and precipitation, and for one leadtime or many lead times simultaneously.In section 2 we define the likelihood and discuss why we think it is a useful measure of forecast skill. Insection 3 we include expressions for the likelihood for the normal distribution and in section 4 we discussrelations between the likelihood and other forecast scoring methods. Finally in section 5 we summariseand describe some areas of future work.

Book ChapterDOI
01 Jan 2008
TL;DR: This chapter reviews the basic probability concepts needed to understand probability forecasting and presents some simple Bayesian approaches for producing well-calibrated probability forecasts.
Abstract: This chapter reviews the basic probability concepts needed to understand probability forecasting and presents some simple Bayesian approaches for producing well-calibrated probability forecasts. Forecasts are inherently uncertain and it is important that this uncertainty is estimated and communicated to forecast users so that they can make optimal decisions. Forecast uncertainty can be quantified by issuing probability statements about future observable outcomes based on current forecasts and past observations and forecasts. Such probabilistic forecasts can be issued in a variety of different forms: as a set of probabilities for a discrete set of events; as probabilities for counts of events; as quantiles of a continuous variable; as interval forecasts (pairs of quantiles); as full probability density functions or cumulative distribution functions; or as forecasts for whole spatial maps. Since models predict the future state of model variables rather than actual real-world observable variables, probability forecasts need to be recalibrated on observations as an inherent part of the forecasting process. Rather than the (marginal) pro bability distribution of ensemble predictions, what forecasters should issue are estimates of the conditional probability distribution of the future observed quantity given the available sample of ensemble predictions.

Journal ArticleDOI
TL;DR: In this article, it is shown that resolution and reliability are directly related to forecast attributes which are desirable on grounds independent of the notion of scores, which can be considered an epistemological justification of measuring forecast quality by proper scores.
Abstract: Scoring rules are an important tool for evaluating the performance of probabilistic forecasting schemes. In the binary case, scoring rules (which are strictly proper) allow for a decomposition into terms related to the resolution and to the reliability of the forecast. This fact is particularly well known for the Brier Score. In this paper, this result is extended to forecasts for finite--valued targets. Both resolution and reliability are shown to have a positive effect on the score. It is demonstrated that resolution and reliability are directly related to forecast attributes which are desirable on grounds independent of the notion of scores. This finding can be considered an epistemological justification of measuring forecast quality by proper scores. A link is provided to the original work of DeGroot et al (1982), extending their concepts of sufficiency and refinement. The relation to the conjectured sharpness principle of Gneiting et al (2005a) is elucidated.

Proceedings Article
01 Jan 2008
TL;DR: In this paper, a lexicographic ranker, LexRank, is proposed, whose rankings are derived not from scores, but from a simple ranking of attribute values obtained from the training data.
Abstract: Given a binary classification task, a ranker is an algorithm that can sort a set of instances from highest to lowest expectation that the instance is positive. In contrast to a classifier, a ranker does not output class predictions A¢â‚¬â€œ although it can be turned into a classifier with help of an additional procedure to split the ranked list into two. A straightforward way to compute rankings is to train a scoring classifier to assign numerical scores to instances, for example the predicted odds that an instance is positive. However, rankings can be computed without scores, as we demonstrate in this paper. We propose a lexicographic ranker, LexRank , whose rankings are derived not from scores, but from a simple ranking of attribute values obtained from the training data. Although various metrics can be used, we show that by using the odds ratio to rank the attribute values we obtain a ranker that is conceptually close to the naive Bayes classifier, in the sense that for every instance of LexRank there exists an instance of naive Bayes that achieves the same ranking. However, the reverse is not true, which means that LexRank is more biased than naive Bayes. We systematically develop the relationships and differences between classification, ranking, and probability estimation, which leads to a novel connection between the Brier score and ROC curves. Combining LexRank with isotonic regression, which derives probability estimates from the ROC convex hull, results in the lexicographic probability estimator LexProb.

Journal ArticleDOI
TL;DR: Weigel et al. as mentioned in this paper showed that the debiased RPSS is an unbiased estimate of the infinite-ensemble Brier skill score for any reliable forecast model.
Abstract: The ranked probability score (RPS) is the sum of the squared differences between cumulative forecast probabilities and cumulative observed probabilities, and measures both forecast reliability and resolution (Murphy 1973). The ranked probability skill score (RPSS) compares the RPS of a forecast with some reference forecast such as “climatology” (using past mean climatic values as the forecast), oriented so that RPSS 0 (RPSS 0) corresponds to a forecast that is less (more) skillful than climatology. Categorical forecast probabilities are often estimated from ensembles of numerical model integrations by counting the number of ensemble members in each category. Finite ensemble size introduces sampling error into such probability estimates, and the RPSS of a reliable forecast model with finite ensemble size is an increasing function of ensemble size (Kumar et al. 2001; Tippett et al. 2007). A similar relation exists between correlation and ensemble size (Sardeshmukh et al. 2000). The dependence of RPSS on ensemble size makes it challenging to use RPSS to compare forecast models with different ensemble sizes. For instance, it may be difficult to know whether a forecast system has higher RPSS because it is based on a superior forecast model or because it uses a larger ensemble. This question often arises in the comparison of multimodel and single model forecasts (Hagedorn et al. 2005; Tippett and Barnston 2008). The dependence of RPSS on ensemble size is not a problem when comparing forecast quality. Improved RPSS is associated with improved forecast quality and is desirable whether it results from larger ensemble size or from a better forecast model. Muller et al. (2005) recently introduced a resampling strategy to estimate the infinite-ensemble RPSS from the finite-ensemble RPSS and called this estimate the “debiased RPSS.” Weigel et al. (2007) derived an analytical formula for the debiased RPSS and proved that it is an unbiased estimate of the infinite-ensemble RPSS in the case of uncorrelated ensemble members, that is, forecasts without skill. Here it is proved that the debiased RPSS is an unbiased estimate of the infiniteensemble RPSS for any reliable forecasts. It is shown that overor underconfident forecasts introduce a dependence of the debiased RPSS on ensemble size. Simplification of the results of Weigel et al. (2007) shows that the debiased RPSS is a multicategory generalization of the result of Richardson (2001) for the Brier skill score.

Posted Content
01 Jan 2008
TL;DR: This study analysed data obtained in researching the problem of forecasting the decisions people make in conflict situations, using a rule to derive probabilistic forecasts from structured analogies data, and transformed multiple singular forecasts for each combination of forecasting method and conflict into probabilism forecasts.
Abstract: How useful are probabilistic forecasts of the outcomes of particular situations? Potentially, they contain more information than unequivocal forecasts and, as they allow a more realistic representation of the relative likelihood of different outcomes, they might be more accurate and therefore more useful to decision makers. To test this proposition, I first compared a Squared-Error Skill Score (SESS) based on the Brier score with an Absolute-Error Skill Score (AESS), and found the latter more closely coincided with decision-makers’ interests. I then analysed data obtained in researching the problem of forecasting the decisions people make in conflict situations. In that research, participants were given lists of decisions that might be made and were asked to make a prediction either by choosing one of the decisions or by allocating percentages or relative frequencies to more than one of them. For this study I transformed the percentage and relative frequencies data into probabilistic forecasts. In most cases the participants chose a single decision. To obtain more data, I used a rule to derive probabilistic forecasts from structured analogies data, and transformed multiple singular forecasts for each combination of forecasting method and conflict into probabilistic forecasts. When compared using the AESS, probabilistic forecasts were not more skilful than unequivocal forecasts.

Journal ArticleDOI
TL;DR: This paper provides a framework for comparing the predictive and diagnostic performance of a parametric, a non-parametric and a combined approach in comparison to the well established proportional hazards model for melanoma patients.
Abstract: Objectives: This paper compares the diagnostic capabilities of flexible ensemble methods modeling the survival time of melanoma patients in comparison to the well established proportional hazards model. Both a random forest type algorithm for censored data as well as a model combination of the proportional hazards model with recursive partitioning are investigated. Methods: Benchmark experiments utilizing the integrated Brier score as a measure for goodness of prediction are the basis of the performance assessment for all competing algorithms. For the purpose of comparing regression relationships represented by the models under test, we describe fitted conditional survival functions by a univariate measure derived from the area under the curve. Based on this measure, we adapt a visualization technique useful for the inspection and comparison of model fits. Results: For the data of malignant melanoma patients the predictive performance of the competing models is on par, allowing for a fair comparison of the fitted relationships. Newly introduced MODplots visualize differences in the fitting structure of the underlying models. Conclusion: The paper provides a framework for comparing the predictive and diagnostic performance of a parametric, a non-parametric and a combined approach.

Journal Article
TL;DR: In this article, the quality of the probabilistic forecasts derived from the Japan Meteorological Agency (JMA) medium-range Ensemble Prediction System (EPS) was evaluated in terms of skill measures including reliability, resolution, and the Brier score.
Abstract: The quality of the Probability of Precipitation (PoP) forecasts derived from the Japan Meteorological Agency (JMA) medium-range Ensemble Prediction System (EPS) was evaluated in terms of skill measures including reliability, resolution, and the Brier score. JMA EPS consists of 24 perturbed forecasts and one control forecast, where the perturbed initial conditions are generated by the breeding method. In terms of a 1 ㎜ day?¹ and a 48 ㎜ day?¹ precipitation threshold, the 5-day PoP forecasts accumulated for 24 hours at 150 points in Japan were verified for two years, from 1 January 2003 to 31 December 2004. The uncalibrated PoP forecasts were found to produce systematic errors due to the model bias, exhibiting overestimation of a light rainfall event and underestimation of a heavy rainfall event. To correct the biased PoP forecasts, a method of calibration in which the PoP was adjusted using the climatology of the observation and direct model output was introduced. The climatology of the observation was made from rain-gauge data from 1971 to 1996, while the model climatology was derived from the 25 precipitation forecasts produced by the JMA EPS from July 2002 to June 2004. A variety of evaluation measurements such as reliability, resolution, relative operating characteristic, and the Brier score were employed to assess the quality of PoP forecasts after calibration. The skill of the calibrated PoP forecast for the event of the light (heavy) rain which is greater than or equal to 1 (48) ㎜ day?¹ was significantly (slightly) improved. It is found that the slight skill increase of the heavy rain resulted from the rarity of the events screened by a threshold of 48 ㎜ day?¹. A rank histogram was also used to assess the improvement of the ensemble spread of the calibrated EPS. The rank histogram of the uncalibrated EPS illustrated the L- or U-shaped distributions which imply the biased forecast. However, the biased distributions were remarkably improved for whole verification period after calibration. Therefore, it is shown that the calibration method of this study is very efficient to correct the biased forecast.

Posted Content
04 Jun 2008
TL;DR: In this article, it is shown that resolution and reliability are directly related to forecast attributes which are desirable on grounds independent of the notion of scores, which can be considered an epistemological justification of measuring forecast quality by proper scores.
Abstract: Scoring rules are an important tool for evaluating the performance of probabilistic forecasting schemes. In the binary case, scoring rules (which are strictly proper) allow for a decomposition into terms related to the resolution and to the reliability of the forecast. This fact is particularly well known for the Brier Score. In this paper, this result is extended to forecasts for finite--valued targets. Both resolution and reliability are shown to have a positive effect on the score. It is demonstrated that resolution and reliability are directly related to forecast attributes which are desirable on grounds independent of the notion of scores. This finding can be considered an epistemological justification of measuring forecast quality by proper scores. A link is provided to the original work of DeGroot et al (1982), extending their concepts of sufficiency and refinement. The relation to the conjectured sharpness principle of Gneiting et al (2005a) is elucidated.

Dissertation
01 Dec 2008
TL;DR: The goal of this study was to develop artificial neural network models for the purpose of predicting both the Probability of Precipitation and quantitative precipitation over a 24-hour period beginning and ending at midnight.
Abstract: Precipitation, in meteorology, is defined as any product, liquid or solid, of atmospheric water vapor that is accumulated onto the earth’s surface. Water, and thus precipitation, has a major impact on our daily livelihood. As such, the uncertainty of both the future occurrence and amount of precipitation can have a negative impact on many sectors of our economy, especially agriculture. There is, therefore, a need to use innovative computer technologies such as artificial intelligence to improve the accuracy of precipitation predictions. Artificial neural networks have been shown to be useful as an aid for the prediction of weather variables. The goal of this study was to develop artificial neural network models for the purpose of predicting both the Probability of Precipitation and quantitative precipitation over a 24-hour period beginning and ending at

Posted Content
TL;DR: In this paper, the authors compared a Squared-Error Skill Score (SESS) based on the Brier score with an Absolute-Error Skills Score (AESS), and found that the latter more closely coincided with decision-makers' interests.
Abstract: How useful are probabilistic forecasts of the outcomes of particular situations? Potentially, they contain more information than unequivocal forecasts and, as they allow a more realistic representation of the relative likelihood of different outcomes, they might be more accurate and therefore more useful to decision makers. To test this proposition, I first compared a Squared-Error Skill Score (SESS) based on the Brier score with an Absolute-Error Skill Score (AESS), and found the latter more closely coincided with decision-makers’ interests. I then analysed data obtained in researching the problem of forecasting the decisions people make in conflict situations. In that research, participants were given lists of decisions that might be made and were asked to make a prediction either by choosing one of the decisions or by allocating percentages or relative frequencies to more than one of them. For this study I transformed the percentage and relative frequencies data into probabilistic forecasts. In most cases the participants chose a single decision. To obtain more data, I used a rule to derive probabilistic forecasts from structured analogies data, and transformed multiple singular forecasts for each combination of forecasting method and conflict into probabilistic forecasts. When compared using the AESS, probabilistic forecasts were not more skilful than unequivocal forecasts.

01 Jan 2008
TL;DR: Weigel et al. as discussed by the authors showed that the debiased RPSS is an unbiased estimate of the infinite-ensemble Brier skill score for any reliable forecast model.
Abstract: The ranked probability score (RPS) is the sum of the squared differences between cumulative forecast probabilities and cumulative observed probabilities, and measures both forecast reliability and resolution (Murphy 1973). The ranked probability skill score (RPSS) compares the RPS of a forecast with some reference forecast such as “climatology” (using past mean climatic values as the forecast), oriented so that RPSS 0 (RPSS 0) corresponds to a forecast that is less (more) skillful than climatology. Categorical forecast probabilities are often estimated from ensembles of numerical model integrations by counting the number of ensemble members in each category. Finite ensemble size introduces sampling error into such probability estimates, and the RPSS of a reliable forecast model with finite ensemble size is an increasing function of ensemble size (Kumar et al. 2001; Tippett et al. 2007). A similar relation exists between correlation and ensemble size (Sardeshmukh et al. 2000). The dependence of RPSS on ensemble size makes it challenging to use RPSS to compare forecast models with different ensemble sizes. For instance, it may be difficult to know whether a forecast system has higher RPSS because it is based on a superior forecast model or because it uses a larger ensemble. This question often arises in the comparison of multimodel and single model forecasts (Hagedorn et al. 2005; Tippett and Barnston 2008). The dependence of RPSS on ensemble size is not a problem when comparing forecast quality. Improved RPSS is associated with improved forecast quality and is desirable whether it results from larger ensemble size or from a better forecast model. Muller et al. (2005) recently introduced a resampling strategy to estimate the infinite-ensemble RPSS from the finite-ensemble RPSS and called this estimate the “debiased RPSS.” Weigel et al. (2007) derived an analytical formula for the debiased RPSS and proved that it is an unbiased estimate of the infinite-ensemble RPSS in the case of uncorrelated ensemble members, that is, forecasts without skill. Here it is proved that the debiased RPSS is an unbiased estimate of the infiniteensemble RPSS for any reliable forecasts. It is shown that overor underconfident forecasts introduce a dependence of the debiased RPSS on ensemble size. Simplification of the results of Weigel et al. (2007) shows that the debiased RPSS is a multicategory generalization of the result of Richardson (2001) for the Brier skill score.