The scoringutils package provides a collection of metrics and proper scoring rules that make it simple to score forecasts against the true observed values. You can either automatically score predictions using eval_forecasts with an appropriately shaped data.frame as input, or you can access the different scoring metrics directly using lower level functions.
library(scoringutils)
library(data.table)A variety of metrics and scoring rules can be accessed through the scoringutils package.
The function bias determines bias from predictive Monte-Carlo samples, automatically recognising whether forecasts are continuous or integer valued.
For continuous forecasts, Bias is measured as \[B_t (P_t, x_t) = 1 - 2 \cdot (P_t (x_t))\]
where \(P_t\) is the empirical cumulative distribution function of the prediction for the true value \(x_t\). Computationally, \(P_t (x_t)\) is just calculated as the fraction of predictive samples for \(x_t\) that are smaller than \(x_t\).
For integer valued forecasts, Bias is measured as
\[B_t (P_t, x_t) = 1 - (P_t (x_t) + P_t (x_t + 1))\]
to adjust for the integer nature of the forecasts. In both cases, Bias can assume values between -1 and 1 and is 0 ideally.
## integer valued forecasts
true_values <- rpois(30, lambda = 1:30)
predictions <- replicate(200, rpois(n = 30, lambda = 1:30))
bias(true_values, predictions)
#> [1] 0.670 -0.090 -0.750 0.875 -0.465 -0.260 -0.780 0.890 0.465 -0.305
#> [11] 0.815 0.225 -0.365 0.520 0.860 0.120 -0.930 -0.235 -0.180 -0.710
#> [21] 0.155 0.860 -0.275 -0.395 0.670 -0.015 0.100 -0.425 0.155 -0.250
## continuous forecasts
true_values <- rnorm(30, mean = 1:30)
predictions <- replicate(200, rnorm(30, mean = 1:30))
bias(true_values, predictions)
#> [1] -0.97 0.70 0.22 0.30 -0.31 0.18 -0.55 0.60 0.72 -1.00 0.62 -0.94
#> [13] 0.80 0.07 -0.12 0.86 0.46 0.99 -0.41 -0.01 -0.82 -0.55 -0.05 0.76
#> [25] -0.62 -0.27 0.83 0.20 0.05 0.97Calibration or reliability of forecasts is the ability of a model to correctly identify its own uncertainty in making predictions. In a model with perfect calibration, the observed data at each time point look as if they came from the predictive probability distribution at that time.
Equivalently, one can inspect the probability integral transform of the predictive distribution at time t,
\[u_t = F_t (x_t)\]
where \(x_t\) is the observed data point at time \(t \text{ in } t_1, …, t_n\), n being the number of forecasts, and \(F_t\) is the (continuous) predictive cumulative probability distribution at time t. If the true probability distribution of outcomes at time t is \(G_t\) then the forecasts \(F_t\) are said to be ideal if \(F_t = G_t\) at all times \(t\). In that case, the probabilities ut are distributed uniformly.
In the case of discrete outcomes such as incidence counts, the PIT is no longer uniform even when forecasts are ideal. In that case a randomised PIT can be used instead:
\[u_t = P_t(k_t) + v \cdot (P_t(k_t) - P_t(k_t - 1) )\]
where \(k_t\) is the observed count, \(P_t(x)\) is the predictive cumulative probability of observing incidence \(k\) at time \(t\), \(P_t (-1) = 0\) by definition and \(v\) is standard uniform and independent of \(k\). If \(P_t\) is the true cumulative probability distribution, then \(u_t\) is standard uniform.
The function checks whether integer or continuous forecasts were provided. It then applies the (randomised) probability integral and tests the values \(u_t\) for uniformity using the Anderson-Darling test.
As a rule of thumb, there is no evidence to suggest a forecasting model is miscalibrated if the p-value found was greater than a threshold of \(p >= 0.1\), some evidence that it was miscalibrated if \(0.01 < p < 0.1\), and good evidence that it was miscalibrated if \(p <= 0.01\). In this context it should be noted, though, that uniformity of the PIT is a necessary but not sufficient condition of calibration. It should als be noted that the test only works given sufficient samples, otherwise the Null hypothesis will often be rejected outright.
Wrapper around the crps_sample function from the scoringRules package. For more information look at the manuals from the scoringRules package. The function can be used for continuous as well as integer valued forecasts. Smaller values are better.
true_values <- rpois(30, lambda = 1:30)
predictions <- replicate(200, rpois(n = 30, lambda = 1:30))
crps(true_values, predictions)
#> [1] 0.249800 1.631050 0.341200 0.983725 0.620950 1.625300 0.609525 0.677250
#> [9] 0.660625 1.184950 1.056475 0.898300 1.611900 2.709650 1.318950 1.463275
#> [17] 1.595375 4.011475 6.708150 2.454250 1.330475 2.420000 3.921700 1.677350
#> [25] 1.235575 1.239350 1.378600 3.112850 3.782525 1.272400Wrapper around the dss_sample function from the scoringRules package. For more information look at the manuals from the scoringRules package. The function can be used for continuous as well as integer valued forecasts. Smaller values are better.
true_values <- rpois(30, lambda = 1:30)
predictions <- replicate(200, rpois(n = 30, lambda = 1:30))
dss(true_values, predictions)
#> [1] -0.04082155 2.64989785 1.05952597 1.77418578 2.33096142 2.28880830
#> [7] 2.88043082 2.67615262 2.74735141 2.59435643 4.64113487 3.73392082
#> [13] 2.49111636 2.61373979 7.65334601 3.89771750 5.60824739 3.57218468
#> [19] 2.99533531 3.02917094 3.02816481 3.83531485 3.53064794 3.32757518
#> [25] 6.30109942 3.49630274 3.21467954 6.98026829 3.28316432 3.84305718Wrapper around the log_sample function from the scoringRules package. For more information look at the manuals from the scoringRules package. The function should not be used for integer valued forecasts. While Log Scores are in principle possible for integer valued foreasts they require a kernel density estimate which is not well defined for discrete values. Smaller values are better.
true_values <- rnorm(30, mean = 1:30)
predictions <- replicate(200, rnorm(n = 30, mean = 1:30))
logs(true_values, predictions)
#> [1] 1.1222207 0.9270440 1.0222345 3.1530725 3.6011929 1.9018255 1.6107046
#> [8] 1.2626969 1.0986045 0.8494772 1.8883361 1.0126492 1.0156812 0.9185447
#> [15] 0.9569473 0.8987192 1.0664823 1.6820619 1.1121091 1.1712868 1.0831759
#> [22] 0.9507887 0.9940377 0.9443842 2.1799323 1.1669121 1.9055078 0.9512410
#> [29] 1.2014414 1.8373335The Brier score is a proper score rule that assesses the accuracy of probabilistic binary predictions. The outcomes can be either 0 or 1, the predictions must be a probability that the true outcome will be 1.
The Brier Score is then computed as the mean squared error between the probabilistic prediction and the true outcome.
\[\text{Brier_Score} = \frac{1}{N} \sum_{t = 1}^{n} (\text{prediction}_t - \text{outcome}_t)^2\]
true_values <- sample(c(0,1), size = 30, replace = TRUE)
predictions <- runif(n = 30, min = 0, max = 1)
brier_score(true_values, predictions)
#> [1] 0.478928The Interval Score is a Proper Scoring Rule to score quantile predictions, following Gneiting and Raftery (2007). Smaller values are better.
The score is computed as
\[ \text{score} = (\text{upper} - \text{lower}) + \\ \frac{2}{\alpha} \cdot (\text{lower} - \text{true_value}) \cdot 1(\text{true_values} < \text{lower}) + \\ \frac{2}{\alpha} \cdot (\text{true_value} - \text{upper}) \cdot 1(\text{true_value} > \text{upper})\]
where \(1()\) is the indicator function and \(\alpha\) is the decimal value that indicates how much is outside the prediction interval. To improve usability, the user is asked to provide an interval range in percentage terms, i.e. interval_range = 90 (percent) for a 90 percent prediction interval. Correspondingly, the user would have to provide the 5% and 95% quantiles (the corresponding alpha would then be 0.1). No specific distribution is assumed, but the range has to be symmetric (i.e you can’t use the 0.1 quantile as the lower bound and the 0.7 quantile as the upper). Setting weigh = TRUE will weigh the score by \(\frac{\alpha}{4}\) such that the Interval Score converges to the CRPS for increasing number of quantiles.
true_values <- rnorm(30, mean = 1:30)
interval_range <- 90
alpha <- (100 - interval_range) / 100
lower <- qnorm(alpha/2, rnorm(30, mean = 1:30))
upper <- qnorm((1- alpha/2), rnorm(30, mean = 1:30))
interval_score(true_values = true_values,
lower = lower,
upper = upper,
interval_range = interval_range)
#> [1] 1.850049 2.749020 2.913171 4.639153 3.142988 27.992039 2.612730
#> [8] 13.355411 3.016331 4.243412 4.430815 3.320419 14.135418 16.342170
#> [15] 17.298731 4.927951 4.486411 3.186419 9.378125 17.132747 5.050236
#> [22] 2.930548 33.811266 3.815382 2.893830 5.743579 3.008370 3.413298
#> [29] 31.603861 6.345946The function eval_forecasts automatically scores forecasts. As input you need a data.frame with your predictions. If you want to have one score per observation, you can call the function with summarised = FALSE. By default, the function returns one aggregate score per model. For more information see ?eval_forecasts.
library(data.table)
# load example data
binary_example <- data.table::setDT(scoringutils::binary_example_data)
print(binary_example, 3, 3)
#> id true_values model predictions
#> 1: 1 0 model1 0.877283166
#> 2: 2 0 model1 0.858535155
#> 3: 3 0 model1 0.639325110
#> ---
#> 58: 28 1 model2 0.259778344
#> 59: 29 0 model2 0.008900052
#> 60: 30 0 model2 0.728497056# score forecasts
eval <- scoringutils::eval_forecasts(binary_example)
print(eval)
#> model Brier_score
#> 1: model1 0.4246968
#> 2: model2 0.3071147eval <- scoringutils::eval_forecasts(binary_example, summarised = FALSE)
print(eval, 3, 3)
#> id true_values model predictions Brier_score
#> 1: 1 0 model1 0.877283166 7.696258e-01
#> 2: 2 0 model1 0.858535155 7.370826e-01
#> 3: 3 0 model1 0.639325110 4.087366e-01
#> ---
#> 58: 28 1 model2 0.259778344 5.479281e-01
#> 59: 29 0 model2 0.008900052 7.921093e-05
#> 60: 30 0 model2 0.728497056 5.307080e-01quantile_example <- data.table::setDT(scoringutils::quantile_example_data)
print(quantile_example, 3, 3)
#> true_values id model lower_90 lower_50 lower_0 upper_0 upper_50
#> 1: 2.659261 1 model1 -0.6448536 0.3255102 1.0 1.0 1.67449
#> 2: 0.822062 2 model1 0.3551464 1.3255102 2.0 2.0 2.67449
#> 3: 2.865579 3 model1 1.3551464 2.3255102 3.0 3.0 3.67449
#> ---
#> 58: 28.726896 28 model2 26.4551464 27.4255102 28.1 28.1 28.77449
#> 59: 29.446186 29 model2 27.4551464 28.4255102 29.1 29.1 29.77449
#> 60: 30.189608 30 model2 28.4551464 29.4255102 30.1 30.1 30.77449
#> upper_90
#> 1: 2.644854
#> 2: 3.644854
#> 3: 4.644854
#> ---
#> 58: 29.744854
#> 59: 30.744854
#> 60: 31.744854eval <- scoringutils::eval_forecasts(quantile_example)
print(eval)
#> model Interval_Score
#> 1: model1 2.605845
#> 2: model2 2.674873eval <- scoringutils::eval_forecasts(quantile_example, summarised = FALSE)
print(eval, 3, 3)
#> id true_values range model lower upper Interval_Score
#> 1: 1 2.659261 0 model1 1.0000000 1.00000 3.318523
#> 2: 1 2.659261 0 model2 1.1000000 1.10000 3.118523
#> 3: 1 2.659261 50 model1 0.3255102 1.67449 5.288066
#> ---
#> 178: 30 30.189608 50 model2 29.4255102 30.77449 1.348980
#> 179: 30 30.189608 90 model1 28.3551464 31.64485 3.289707
#> 180: 30 30.189608 90 model2 28.4551464 31.74485 3.289707integer_example <- data.table::setDT(scoringutils::integer_example_data)
print(integer_example, 3, 3)
#> id model true_values sample predictions
#> 1: 1 model1 2 1 1
#> 2: 1 model1 2 2 2
#> 3: 1 model1 2 3 1
#> ---
#> 29998: 30 model2 31 498 26
#> 29999: 30 model2 31 499 37
#> 30000: 30 model2 31 500 31eval <- scoringutils::eval_forecasts(integer_example)
print(eval)
#> model sharpness bias DSS CRPS pit_p_val pit_sd
#> 1: model1 3.78063 0.1440667 3.247092 1.800532 0.23576474 0.06418870
#> 2: model2 3.87947 0.2474667 3.404113 1.955676 0.03237753 0.01031457eval <- scoringutils::eval_forecasts(integer_example, summarised = FALSE)
print(eval, 3, 3)
#> model id sharpness bias DSS CRPS pit_p_val pit_sd
#> 1: model1 1 1.4826 -0.684 1.0826668 0.703120 0.25227497 0.039734167
#> 2: model2 1 1.4826 -0.410 0.5830578 0.445968 0.03217469 0.007278526
#> 3: model1 2 1.4826 -0.532 1.1906665 0.674004 0.25227497 0.039734167
#> ---
#> 58: model2 29 5.9304 0.542 3.9776803 2.406524 0.03217469 0.007278526
#> 59: model1 30 5.9304 -0.106 3.4417915 1.322152 0.25227497 0.039734167
#> 60: model2 30 4.4478 -0.134 3.4220738 1.258856 0.03217469 0.007278526continuous_example <- data.table::setDT(scoringutils::continuous_example_data)
print(continuous_example, 3, 3)
#> id model true_values sample predictions
#> 1: 1 model1 0.03007379 1 -0.203426069
#> 2: 1 model1 0.03007379 2 0.007621269
#> 3: 1 model1 0.03007379 3 -2.086657003
#> ---
#> 2998: 30 model2 -2.93749990 48 0.333292940
#> 2999: 30 model2 -2.93749990 49 -2.369034642
#> 3000: 30 model2 -2.93749990 50 -0.144544426eval <- scoringutils::eval_forecasts(continuous_example)
print(eval)
#> model sharpness bias DSS CRPS LogS pit_p_val
#> 1: model1 0.9614030 -0.070666667 1.728001 0.7596405 1.826296 2e-05
#> 2: model2 0.9640941 -0.006666667 1.624224 0.7417875 1.804958 2e-05eval <- scoringutils::eval_forecasts(continuous_example, summarised = FALSE)
print(eval, 3, 3)
#> model id sharpness bias DSS CRPS LogS pit_p_val
#> 1: model1 1 1.1283071 0.04 -0.08898351 0.2580341 1.093753 2e-05
#> 2: model2 1 1.0705271 -0.04 0.07987778 0.2556484 1.087163 2e-05
#> 3: model1 2 1.0057182 -0.92 3.06985466 1.2193656 2.370296 2e-05
#> ---
#> 58: model2 29 0.9827873 -0.76 1.13727388 0.6572423 1.489301 2e-05
#> 59: model1 30 0.9887154 1.00 8.70563844 2.2980866 4.012829 2e-05
#> 60: model2 30 0.9985101 0.96 6.58413707 2.3375785 3.107910 2e-05