The scoringutils package provides a collection of metrics and proper scoring rules that make it simple to score forecasts against the true observed values. You can either automatically score predictions using eval_forecasts
with an appropriately shaped data.frame
as input, or you can access the different scoring metrics directly using lower level functions.
library(scoringutils)
library(data.table)
A variety of metrics and scoring rules can be accessed through the scoringutils
package.
The function bias
determines bias from predictive Monte-Carlo samples, automatically recognising whether forecasts are continuous or integer valued.
For continuous forecasts, Bias is measured as \[B_t (P_t, x_t) = 1 - 2 \cdot (P_t (x_t))\]
where \(P_t\) is the empirical cumulative distribution function of the prediction for the true value \(x_t\). Computationally, \(P_t (x_t)\) is just calculated as the fraction of predictive samples for \(x_t\) that are smaller than \(x_t\).
For integer valued forecasts, Bias is measured as
\[B_t (P_t, x_t) = 1 - (P_t (x_t) + P_t (x_t + 1))\]
to adjust for the integer nature of the forecasts. In both cases, Bias can assume values between -1 and 1 and is 0 ideally.
## integer valued forecasts
rpois(30, lambda = 1:30)
true_values <- replicate(200, rpois(n = 30, lambda = 1:30))
predictions <-bias(true_values, predictions)
#> [1] 0.670 -0.090 -0.750 0.875 -0.465 -0.260 -0.780 0.890 0.465 -0.305
#> [11] 0.815 0.225 -0.365 0.520 0.860 0.120 -0.930 -0.235 -0.180 -0.710
#> [21] 0.155 0.860 -0.275 -0.395 0.670 -0.015 0.100 -0.425 0.155 -0.250
## continuous forecasts
rnorm(30, mean = 1:30)
true_values <- replicate(200, rnorm(30, mean = 1:30))
predictions <-bias(true_values, predictions)
#> [1] -0.97 0.70 0.22 0.30 -0.31 0.18 -0.55 0.60 0.72 -1.00 0.62 -0.94
#> [13] 0.80 0.07 -0.12 0.86 0.46 0.99 -0.41 -0.01 -0.82 -0.55 -0.05 0.76
#> [25] -0.62 -0.27 0.83 0.20 0.05 0.97
Calibration or reliability of forecasts is the ability of a model to correctly identify its own uncertainty in making predictions. In a model with perfect calibration, the observed data at each time point look as if they came from the predictive probability distribution at that time.
Equivalently, one can inspect the probability integral transform of the predictive distribution at time t,
\[u_t = F_t (x_t)\]
where \(x_t\) is the observed data point at time \(t \text{ in } t_1, …, t_n\), n being the number of forecasts, and \(F_t\) is the (continuous) predictive cumulative probability distribution at time t. If the true probability distribution of outcomes at time t is \(G_t\) then the forecasts \(F_t\) are said to be ideal if \(F_t = G_t\) at all times \(t\). In that case, the probabilities ut are distributed uniformly.
In the case of discrete outcomes such as incidence counts, the PIT is no longer uniform even when forecasts are ideal. In that case a randomised PIT can be used instead:
\[u_t = P_t(k_t) + v \cdot (P_t(k_t) - P_t(k_t - 1) )\]
where \(k_t\) is the observed count, \(P_t(x)\) is the predictive cumulative probability of observing incidence \(k\) at time \(t\), \(P_t (-1) = 0\) by definition and \(v\) is standard uniform and independent of \(k\). If \(P_t\) is the true cumulative probability distribution, then \(u_t\) is standard uniform.
The function checks whether integer or continuous forecasts were provided. It then applies the (randomised) probability integral and tests the values \(u_t\) for uniformity using the Anderson-Darling test.
As a rule of thumb, there is no evidence to suggest a forecasting model is miscalibrated if the p-value found was greater than a threshold of \(p >= 0.1\), some evidence that it was miscalibrated if \(0.01 < p < 0.1\), and good evidence that it was miscalibrated if \(p <= 0.01\). In this context it should be noted, though, that uniformity of the PIT is a necessary but not sufficient condition of calibration. It should als be noted that the test only works given sufficient samples, otherwise the Null hypothesis will often be rejected outright.
Wrapper around the crps_sample
function from the scoringRules
package. For more information look at the manuals from the scoringRules
package. The function can be used for continuous as well as integer valued forecasts. Smaller values are better.
rpois(30, lambda = 1:30)
true_values <- replicate(200, rpois(n = 30, lambda = 1:30))
predictions <-crps(true_values, predictions)
#> [1] 0.249800 1.631050 0.341200 0.983725 0.620950 1.625300 0.609525 0.677250
#> [9] 0.660625 1.184950 1.056475 0.898300 1.611900 2.709650 1.318950 1.463275
#> [17] 1.595375 4.011475 6.708150 2.454250 1.330475 2.420000 3.921700 1.677350
#> [25] 1.235575 1.239350 1.378600 3.112850 3.782525 1.272400
Wrapper around the dss_sample
function from the scoringRules
package. For more information look at the manuals from the scoringRules
package. The function can be used for continuous as well as integer valued forecasts. Smaller values are better.
rpois(30, lambda = 1:30)
true_values <- replicate(200, rpois(n = 30, lambda = 1:30))
predictions <-dss(true_values, predictions)
#> [1] -0.04082155 2.64989785 1.05952597 1.77418578 2.33096142 2.28880830
#> [7] 2.88043082 2.67615262 2.74735141 2.59435643 4.64113487 3.73392082
#> [13] 2.49111636 2.61373979 7.65334601 3.89771750 5.60824739 3.57218468
#> [19] 2.99533531 3.02917094 3.02816481 3.83531485 3.53064794 3.32757518
#> [25] 6.30109942 3.49630274 3.21467954 6.98026829 3.28316432 3.84305718
Wrapper around the log_sample
function from the scoringRules
package. For more information look at the manuals from the scoringRules
package. The function should not be used for integer valued forecasts. While Log Scores are in principle possible for integer valued foreasts they require a kernel density estimate which is not well defined for discrete values. Smaller values are better.
rnorm(30, mean = 1:30)
true_values <- replicate(200, rnorm(n = 30, mean = 1:30))
predictions <-logs(true_values, predictions)
#> [1] 1.1222207 0.9270440 1.0222345 3.1530725 3.6011929 1.9018255 1.6107046
#> [8] 1.2626969 1.0986045 0.8494772 1.8883361 1.0126492 1.0156812 0.9185447
#> [15] 0.9569473 0.8987192 1.0664823 1.6820619 1.1121091 1.1712868 1.0831759
#> [22] 0.9507887 0.9940377 0.9443842 2.1799323 1.1669121 1.9055078 0.9512410
#> [29] 1.2014414 1.8373335
The Brier score is a proper score rule that assesses the accuracy of probabilistic binary predictions. The outcomes can be either 0 or 1, the predictions must be a probability that the true outcome will be 1.
The Brier Score is then computed as the mean squared error between the probabilistic prediction and the true outcome.
\[\text{Brier_Score} = \frac{1}{N} \sum_{t = 1}^{n} (\text{prediction}_t - \text{outcome}_t)^2\]
sample(c(0,1), size = 30, replace = TRUE)
true_values <- runif(n = 30, min = 0, max = 1)
predictions <-
brier_score(true_values, predictions)
#> [1] 0.478928
The Interval Score is a Proper Scoring Rule to score quantile predictions, following Gneiting and Raftery (2007). Smaller values are better.
The score is computed as
\[ \text{score} = (\text{upper} - \text{lower}) + \\ \frac{2}{\alpha} \cdot (\text{lower} - \text{true_value}) \cdot 1(\text{true_values} < \text{lower}) + \\ \frac{2}{\alpha} \cdot (\text{true_value} - \text{upper}) \cdot 1(\text{true_value} > \text{upper})\]
where \(1()\) is the indicator function and \(\alpha\) is the decimal value that indicates how much is outside the prediction interval. To improve usability, the user is asked to provide an interval range in percentage terms, i.e. interval_range = 90 (percent) for a 90 percent prediction interval. Correspondingly, the user would have to provide the 5% and 95% quantiles (the corresponding alpha would then be 0.1). No specific distribution is assumed, but the range has to be symmetric (i.e you can’t use the 0.1 quantile as the lower bound and the 0.7 quantile as the upper). Setting weigh = TRUE
will weigh the score by \(\frac{\alpha}{4}\) such that the Interval Score converges to the CRPS for increasing number of quantiles.
rnorm(30, mean = 1:30)
true_values <- 90
interval_range <- (100 - interval_range) / 100
alpha <- qnorm(alpha/2, rnorm(30, mean = 1:30))
lower <- qnorm((1- alpha/2), rnorm(30, mean = 1:30))
upper <-
interval_score(true_values = true_values,
lower = lower,
upper = upper,
interval_range = interval_range)
#> [1] 1.850049 2.749020 2.913171 4.639153 3.142988 27.992039 2.612730
#> [8] 13.355411 3.016331 4.243412 4.430815 3.320419 14.135418 16.342170
#> [15] 17.298731 4.927951 4.486411 3.186419 9.378125 17.132747 5.050236
#> [22] 2.930548 33.811266 3.815382 2.893830 5.743579 3.008370 3.413298
#> [29] 31.603861 6.345946
The function eval_forecasts
automatically scores forecasts. As input you need a data.frame
with your predictions. If you want to have one score per observation, you can call the function with summarised = FALSE
. By default, the function returns one aggregate score per model. For more information see ?eval_forecasts
.
library(data.table)
# load example data
data.table::setDT(scoringutils::binary_example_data)
binary_example <-print(binary_example, 3, 3)
#> id true_values model predictions
#> 1: 1 0 model1 0.877283166
#> 2: 2 0 model1 0.858535155
#> 3: 3 0 model1 0.639325110
#> ---
#> 58: 28 1 model2 0.259778344
#> 59: 29 0 model2 0.008900052
#> 60: 30 0 model2 0.728497056
# score forecasts
scoringutils::eval_forecasts(binary_example)
eval <-print(eval)
#> model Brier_score
#> 1: model1 0.4246968
#> 2: model2 0.3071147
scoringutils::eval_forecasts(binary_example, summarised = FALSE)
eval <-print(eval, 3, 3)
#> id true_values model predictions Brier_score
#> 1: 1 0 model1 0.877283166 7.696258e-01
#> 2: 2 0 model1 0.858535155 7.370826e-01
#> 3: 3 0 model1 0.639325110 4.087366e-01
#> ---
#> 58: 28 1 model2 0.259778344 5.479281e-01
#> 59: 29 0 model2 0.008900052 7.921093e-05
#> 60: 30 0 model2 0.728497056 5.307080e-01
data.table::setDT(scoringutils::quantile_example_data)
quantile_example <-print(quantile_example, 3, 3)
#> true_values id model lower_90 lower_50 lower_0 upper_0 upper_50
#> 1: 2.659261 1 model1 -0.6448536 0.3255102 1.0 1.0 1.67449
#> 2: 0.822062 2 model1 0.3551464 1.3255102 2.0 2.0 2.67449
#> 3: 2.865579 3 model1 1.3551464 2.3255102 3.0 3.0 3.67449
#> ---
#> 58: 28.726896 28 model2 26.4551464 27.4255102 28.1 28.1 28.77449
#> 59: 29.446186 29 model2 27.4551464 28.4255102 29.1 29.1 29.77449
#> 60: 30.189608 30 model2 28.4551464 29.4255102 30.1 30.1 30.77449
#> upper_90
#> 1: 2.644854
#> 2: 3.644854
#> 3: 4.644854
#> ---
#> 58: 29.744854
#> 59: 30.744854
#> 60: 31.744854
scoringutils::eval_forecasts(quantile_example)
eval <-print(eval)
#> model Interval_Score
#> 1: model1 2.605845
#> 2: model2 2.674873
scoringutils::eval_forecasts(quantile_example, summarised = FALSE)
eval <-print(eval, 3, 3)
#> id true_values range model lower upper Interval_Score
#> 1: 1 2.659261 0 model1 1.0000000 1.00000 3.318523
#> 2: 1 2.659261 0 model2 1.1000000 1.10000 3.118523
#> 3: 1 2.659261 50 model1 0.3255102 1.67449 5.288066
#> ---
#> 178: 30 30.189608 50 model2 29.4255102 30.77449 1.348980
#> 179: 30 30.189608 90 model1 28.3551464 31.64485 3.289707
#> 180: 30 30.189608 90 model2 28.4551464 31.74485 3.289707
data.table::setDT(scoringutils::integer_example_data)
integer_example <-print(integer_example, 3, 3)
#> id model true_values sample predictions
#> 1: 1 model1 2 1 1
#> 2: 1 model1 2 2 2
#> 3: 1 model1 2 3 1
#> ---
#> 29998: 30 model2 31 498 26
#> 29999: 30 model2 31 499 37
#> 30000: 30 model2 31 500 31
scoringutils::eval_forecasts(integer_example)
eval <-print(eval)
#> model sharpness bias DSS CRPS pit_p_val pit_sd
#> 1: model1 3.78063 0.1440667 3.247092 1.800532 0.23576474 0.06418870
#> 2: model2 3.87947 0.2474667 3.404113 1.955676 0.03237753 0.01031457
scoringutils::eval_forecasts(integer_example, summarised = FALSE)
eval <-print(eval, 3, 3)
#> model id sharpness bias DSS CRPS pit_p_val pit_sd
#> 1: model1 1 1.4826 -0.684 1.0826668 0.703120 0.25227497 0.039734167
#> 2: model2 1 1.4826 -0.410 0.5830578 0.445968 0.03217469 0.007278526
#> 3: model1 2 1.4826 -0.532 1.1906665 0.674004 0.25227497 0.039734167
#> ---
#> 58: model2 29 5.9304 0.542 3.9776803 2.406524 0.03217469 0.007278526
#> 59: model1 30 5.9304 -0.106 3.4417915 1.322152 0.25227497 0.039734167
#> 60: model2 30 4.4478 -0.134 3.4220738 1.258856 0.03217469 0.007278526
data.table::setDT(scoringutils::continuous_example_data)
continuous_example <-print(continuous_example, 3, 3)
#> id model true_values sample predictions
#> 1: 1 model1 0.03007379 1 -0.203426069
#> 2: 1 model1 0.03007379 2 0.007621269
#> 3: 1 model1 0.03007379 3 -2.086657003
#> ---
#> 2998: 30 model2 -2.93749990 48 0.333292940
#> 2999: 30 model2 -2.93749990 49 -2.369034642
#> 3000: 30 model2 -2.93749990 50 -0.144544426
scoringutils::eval_forecasts(continuous_example)
eval <-print(eval)
#> model sharpness bias DSS CRPS LogS pit_p_val
#> 1: model1 0.9614030 -0.070666667 1.728001 0.7596405 1.826296 2e-05
#> 2: model2 0.9640941 -0.006666667 1.624224 0.7417875 1.804958 2e-05
scoringutils::eval_forecasts(continuous_example, summarised = FALSE)
eval <-print(eval, 3, 3)
#> model id sharpness bias DSS CRPS LogS pit_p_val
#> 1: model1 1 1.1283071 0.04 -0.08898351 0.2580341 1.093753 2e-05
#> 2: model2 1 1.0705271 -0.04 0.07987778 0.2556484 1.087163 2e-05
#> 3: model1 2 1.0057182 -0.92 3.06985466 1.2193656 2.370296 2e-05
#> ---
#> 58: model2 29 0.9827873 -0.76 1.13727388 0.6572423 1.489301 2e-05
#> 59: model1 30 0.9887154 1.00 8.70563844 2.2980866 4.012829 2e-05
#> 60: model2 30 0.9985101 0.96 6.58413707 2.3375785 3.107910 2e-05