scoringutils

Nikos Bosse

2020-06-09

Introduction

The scoringutils package provides a collection of metrics and proper scoring rules that make it simple to score forecasts against the true observed values. You can either automatically score predictions using eval_forecasts with an appropriately shaped data.frame as input, or you can access the different scoring metrics directly using lower level functions.

library(scoringutils)
library(data.table)

Scoring Forecasts Directly

A variety of metrics and scoring rules can be accessed through the scoringutils package.

Bias

The function bias determines bias from predictive Monte-Carlo samples, automatically recognising whether forecasts are continuous or integer valued.

For continuous forecasts, Bias is measured as \[B_t (P_t, x_t) = 1 - 2 \cdot (P_t (x_t))\]

where \(P_t\) is the empirical cumulative distribution function of the prediction for the true value \(x_t\). Computationally, \(P_t (x_t)\) is just calculated as the fraction of predictive samples for \(x_t\) that are smaller than \(x_t\).

For integer valued forecasts, Bias is measured as

\[B_t (P_t, x_t) = 1 - (P_t (x_t) + P_t (x_t + 1))\]

to adjust for the integer nature of the forecasts. In both cases, Bias can assume values between -1 and 1 and is 0 ideally.

## integer valued forecasts
true_values <- rpois(30, lambda = 1:30)
predictions <- replicate(200, rpois(n = 30, lambda = 1:30))
bias(true_values, predictions)
#>  [1]  0.670 -0.090 -0.750  0.875 -0.465 -0.260 -0.780  0.890  0.465 -0.305
#> [11]  0.815  0.225 -0.365  0.520  0.860  0.120 -0.930 -0.235 -0.180 -0.710
#> [21]  0.155  0.860 -0.275 -0.395  0.670 -0.015  0.100 -0.425  0.155 -0.250

## continuous forecasts
true_values <- rnorm(30, mean = 1:30)
predictions <- replicate(200, rnorm(30, mean = 1:30))
bias(true_values, predictions)
#>  [1] -0.97  0.70  0.22  0.30 -0.31  0.18 -0.55  0.60  0.72 -1.00  0.62 -0.94
#> [13]  0.80  0.07 -0.12  0.86  0.46  0.99 -0.41 -0.01 -0.82 -0.55 -0.05  0.76
#> [25] -0.62 -0.27  0.83  0.20  0.05  0.97

Sharpness

Sharpness is the ability of the model to generate predictions within a narrow range. It is a data-independent measure, and is purely a feature of the forecasts themselves.

Shaprness of predictive samples corresponding to one single true value is measured as the normalised median of the absolute deviation from the median of the predictive samples. For details, see ?stats::mad

predictions <- replicate(200, rpois(n = 30, lambda = 1:30))
sharpness(predictions)
#>  [1] 1.4826 1.4826 1.4826 2.9652 2.9652 2.9652 2.9652 2.9652 2.9652 2.9652
#> [11] 2.9652 2.9652 2.9652 4.4478 4.4478 2.9652 4.4478 4.4478 4.4478 4.4478
#> [21] 4.4478 5.9304 4.4478 5.9304 4.4478 4.4478 5.9304 4.4478 5.9304 4.4478

Calibration

Calibration or reliability of forecasts is the ability of a model to correctly identify its own uncertainty in making predictions. In a model with perfect calibration, the observed data at each time point look as if they came from the predictive probability distribution at that time.

Equivalently, one can inspect the probability integral transform of the predictive distribution at time t,

\[u_t = F_t (x_t)\]

where \(x_t\) is the observed data point at time \(t \text{ in } t_1, …, t_n\), n being the number of forecasts, and \(F_t\) is the (continuous) predictive cumulative probability distribution at time t. If the true probability distribution of outcomes at time t is \(G_t\) then the forecasts \(F_t\) are said to be ideal if \(F_t = G_t\) at all times \(t\). In that case, the probabilities ut are distributed uniformly.

In the case of discrete outcomes such as incidence counts, the PIT is no longer uniform even when forecasts are ideal. In that case a randomised PIT can be used instead:

\[u_t = P_t(k_t) + v \cdot (P_t(k_t) - P_t(k_t - 1) )\]

where \(k_t\) is the observed count, \(P_t(x)\) is the predictive cumulative probability of observing incidence \(k\) at time \(t\), \(P_t (-1) = 0\) by definition and \(v\) is standard uniform and independent of \(k\). If \(P_t\) is the true cumulative probability distribution, then \(u_t\) is standard uniform.

The function checks whether integer or continuous forecasts were provided. It then applies the (randomised) probability integral and tests the values \(u_t\) for uniformity using the Anderson-Darling test.

As a rule of thumb, there is no evidence to suggest a forecasting model is miscalibrated if the p-value found was greater than a threshold of \(p >= 0.1\), some evidence that it was miscalibrated if \(0.01 < p < 0.1\), and good evidence that it was miscalibrated if \(p <= 0.01\). In this context it should be noted, though, that uniformity of the PIT is a necessary but not sufficient condition of calibration. It should als be noted that the test only works given sufficient samples, otherwise the Null hypothesis will often be rejected outright.

Continuous Ranked Probability Score (CRPS)

Wrapper around the crps_sample function from the scoringRules package. For more information look at the manuals from the scoringRules package. The function can be used for continuous as well as integer valued forecasts. Smaller values are better.

true_values <- rpois(30, lambda = 1:30)
predictions <- replicate(200, rpois(n = 30, lambda = 1:30))
crps(true_values, predictions)
#>  [1] 0.249800 1.631050 0.341200 0.983725 0.620950 1.625300 0.609525 0.677250
#>  [9] 0.660625 1.184950 1.056475 0.898300 1.611900 2.709650 1.318950 1.463275
#> [17] 1.595375 4.011475 6.708150 2.454250 1.330475 2.420000 3.921700 1.677350
#> [25] 1.235575 1.239350 1.378600 3.112850 3.782525 1.272400

Dawid-Sebastiani Score (DSS)

Wrapper around the dss_sample function from the scoringRules package. For more information look at the manuals from the scoringRules package. The function can be used for continuous as well as integer valued forecasts. Smaller values are better.

true_values <- rpois(30, lambda = 1:30)
predictions <- replicate(200, rpois(n = 30, lambda = 1:30))
dss(true_values, predictions)
#>  [1] -0.04082155  2.64989785  1.05952597  1.77418578  2.33096142  2.28880830
#>  [7]  2.88043082  2.67615262  2.74735141  2.59435643  4.64113487  3.73392082
#> [13]  2.49111636  2.61373979  7.65334601  3.89771750  5.60824739  3.57218468
#> [19]  2.99533531  3.02917094  3.02816481  3.83531485  3.53064794  3.32757518
#> [25]  6.30109942  3.49630274  3.21467954  6.98026829  3.28316432  3.84305718

Log Score

Wrapper around the log_sample function from the scoringRules package. For more information look at the manuals from the scoringRules package. The function should not be used for integer valued forecasts. While Log Scores are in principle possible for integer valued foreasts they require a kernel density estimate which is not well defined for discrete values. Smaller values are better.

true_values <- rnorm(30, mean = 1:30)
predictions <- replicate(200, rnorm(n = 30, mean = 1:30))
logs(true_values, predictions)
#>  [1] 1.1222207 0.9270440 1.0222345 3.1530725 3.6011929 1.9018255 1.6107046
#>  [8] 1.2626969 1.0986045 0.8494772 1.8883361 1.0126492 1.0156812 0.9185447
#> [15] 0.9569473 0.8987192 1.0664823 1.6820619 1.1121091 1.1712868 1.0831759
#> [22] 0.9507887 0.9940377 0.9443842 2.1799323 1.1669121 1.9055078 0.9512410
#> [29] 1.2014414 1.8373335

Brier Score

The Brier score is a proper score rule that assesses the accuracy of probabilistic binary predictions. The outcomes can be either 0 or 1, the predictions must be a probability that the true outcome will be 1.

The Brier Score is then computed as the mean squared error between the probabilistic prediction and the true outcome.

\[\text{Brier_Score} = \frac{1}{N} \sum_{t = 1}^{n} (\text{prediction}_t - \text{outcome}_t)^2\]

true_values <- sample(c(0,1), size = 30, replace = TRUE)
predictions <- runif(n = 30, min = 0, max = 1)

brier_score(true_values, predictions)
#> [1] 0.478928

Interval Score

The Interval Score is a Proper Scoring Rule to score quantile predictions, following Gneiting and Raftery (2007). Smaller values are better.

The score is computed as

\[ \text{score} = (\text{upper} - \text{lower}) + \\ \frac{2}{\alpha} \cdot (\text{lower} - \text{true_value}) \cdot 1(\text{true_values} < \text{lower}) + \\ \frac{2}{\alpha} \cdot (\text{true_value} - \text{upper}) \cdot 1(\text{true_value} > \text{upper})\]

where \(1()\) is the indicator function and \(\alpha\) is the decimal value that indicates how much is outside the prediction interval. To improve usability, the user is asked to provide an interval range in percentage terms, i.e. interval_range = 90 (percent) for a 90 percent prediction interval. Correspondingly, the user would have to provide the 5% and 95% quantiles (the corresponding alpha would then be 0.1). No specific distribution is assumed, but the range has to be symmetric (i.e you can’t use the 0.1 quantile as the lower bound and the 0.7 quantile as the upper). Setting weigh = TRUE will weigh the score by \(\frac{\alpha}{4}\) such that the Interval Score converges to the CRPS for increasing number of quantiles.

true_values <- rnorm(30, mean = 1:30)
interval_range <- 90
alpha <- (100 - interval_range) / 100
lower <- qnorm(alpha/2, rnorm(30, mean = 1:30))
upper <- qnorm((1- alpha/2), rnorm(30, mean = 1:30))

interval_score(true_values = true_values,
               lower = lower,
               upper = upper,
               interval_range = interval_range)
#>  [1]  1.850049  2.749020  2.913171  4.639153  3.142988 27.992039  2.612730
#>  [8] 13.355411  3.016331  4.243412  4.430815  3.320419 14.135418 16.342170
#> [15] 17.298731  4.927951  4.486411  3.186419  9.378125 17.132747  5.050236
#> [22]  2.930548 33.811266  3.815382  2.893830  5.743579  3.008370  3.413298
#> [29] 31.603861  6.345946

Automatically Scoring Forecasts

The function eval_forecasts automatically scores forecasts. As input you need a data.frame with your predictions. If you want to have one score per observation, you can call the function with summarised = FALSE. By default, the function returns one aggregate score per model. For more information see ?eval_forecasts.

Scoring Probability Binary Forecasts

library(data.table)

# load example data
binary_example <- data.table::setDT(scoringutils::binary_example_data)
print(binary_example, 3, 3)
#>     id true_values  model predictions
#>  1:  1           0 model1 0.877283166
#>  2:  2           0 model1 0.858535155
#>  3:  3           0 model1 0.639325110
#> ---                                  
#> 58: 28           1 model2 0.259778344
#> 59: 29           0 model2 0.008900052
#> 60: 30           0 model2 0.728497056
# score forecasts
eval <- scoringutils::eval_forecasts(binary_example)
print(eval)
#>     model Brier_score
#> 1: model1   0.4246968
#> 2: model2   0.3071147
eval <- scoringutils::eval_forecasts(binary_example, summarised = FALSE)
print(eval, 3, 3)
#>     id true_values  model predictions  Brier_score
#>  1:  1           0 model1 0.877283166 7.696258e-01
#>  2:  2           0 model1 0.858535155 7.370826e-01
#>  3:  3           0 model1 0.639325110 4.087366e-01
#> ---                                               
#> 58: 28           1 model2 0.259778344 5.479281e-01
#> 59: 29           0 model2 0.008900052 7.921093e-05
#> 60: 30           0 model2 0.728497056 5.307080e-01

Scoring Quantile Forecasts

quantile_example <- data.table::setDT(scoringutils::quantile_example_data)
print(quantile_example, 3, 3)
#>     true_values id  model   lower_90   lower_50 lower_0 upper_0 upper_50
#>  1:    2.659261  1 model1 -0.6448536  0.3255102     1.0     1.0  1.67449
#>  2:    0.822062  2 model1  0.3551464  1.3255102     2.0     2.0  2.67449
#>  3:    2.865579  3 model1  1.3551464  2.3255102     3.0     3.0  3.67449
#> ---                                                                     
#> 58:   28.726896 28 model2 26.4551464 27.4255102    28.1    28.1 28.77449
#> 59:   29.446186 29 model2 27.4551464 28.4255102    29.1    29.1 29.77449
#> 60:   30.189608 30 model2 28.4551464 29.4255102    30.1    30.1 30.77449
#>      upper_90
#>  1:  2.644854
#>  2:  3.644854
#>  3:  4.644854
#> ---          
#> 58: 29.744854
#> 59: 30.744854
#> 60: 31.744854
eval <- scoringutils::eval_forecasts(quantile_example)
print(eval)
#>     model Interval_Score
#> 1: model1       2.605845
#> 2: model2       2.674873
eval <- scoringutils::eval_forecasts(quantile_example, summarised = FALSE)
print(eval, 3, 3)
#>      id true_values range  model      lower    upper Interval_Score
#>   1:  1    2.659261     0 model1  1.0000000  1.00000       3.318523
#>   2:  1    2.659261     0 model2  1.1000000  1.10000       3.118523
#>   3:  1    2.659261    50 model1  0.3255102  1.67449       5.288066
#>  ---                                                               
#> 178: 30   30.189608    50 model2 29.4255102 30.77449       1.348980
#> 179: 30   30.189608    90 model1 28.3551464 31.64485       3.289707
#> 180: 30   30.189608    90 model2 28.4551464 31.74485       3.289707

Scoring Integer Forecasts

integer_example <- data.table::setDT(scoringutils::integer_example_data)
print(integer_example, 3, 3)
#>        id  model true_values sample predictions
#>     1:  1 model1           2      1           1
#>     2:  1 model1           2      2           2
#>     3:  1 model1           2      3           1
#>    ---                                         
#> 29998: 30 model2          31    498          26
#> 29999: 30 model2          31    499          37
#> 30000: 30 model2          31    500          31
eval <- scoringutils::eval_forecasts(integer_example)
print(eval)
#>     model sharpness      bias      DSS     CRPS  pit_p_val     pit_sd
#> 1: model1   3.78063 0.1440667 3.247092 1.800532 0.23576474 0.06418870
#> 2: model2   3.87947 0.2474667 3.404113 1.955676 0.03237753 0.01031457
eval <- scoringutils::eval_forecasts(integer_example, summarised = FALSE)
print(eval, 3, 3)
#>      model id sharpness   bias       DSS     CRPS  pit_p_val      pit_sd
#>  1: model1  1    1.4826 -0.684 1.0826668 0.703120 0.25227497 0.039734167
#>  2: model2  1    1.4826 -0.410 0.5830578 0.445968 0.03217469 0.007278526
#>  3: model1  2    1.4826 -0.532 1.1906665 0.674004 0.25227497 0.039734167
#> ---                                                                     
#> 58: model2 29    5.9304  0.542 3.9776803 2.406524 0.03217469 0.007278526
#> 59: model1 30    5.9304 -0.106 3.4417915 1.322152 0.25227497 0.039734167
#> 60: model2 30    4.4478 -0.134 3.4220738 1.258856 0.03217469 0.007278526

Scoring Continuous Forecasts

continuous_example <- data.table::setDT(scoringutils::continuous_example_data)
print(continuous_example, 3, 3)
#>       id  model true_values sample  predictions
#>    1:  1 model1  0.03007379      1 -0.203426069
#>    2:  1 model1  0.03007379      2  0.007621269
#>    3:  1 model1  0.03007379      3 -2.086657003
#>   ---                                          
#> 2998: 30 model2 -2.93749990     48  0.333292940
#> 2999: 30 model2 -2.93749990     49 -2.369034642
#> 3000: 30 model2 -2.93749990     50 -0.144544426
eval <- scoringutils::eval_forecasts(continuous_example)
print(eval)
#>     model sharpness         bias      DSS      CRPS     LogS pit_p_val
#> 1: model1 0.9614030 -0.070666667 1.728001 0.7596405 1.826296     2e-05
#> 2: model2 0.9640941 -0.006666667 1.624224 0.7417875 1.804958     2e-05
eval <- scoringutils::eval_forecasts(continuous_example, summarised = FALSE)
print(eval, 3, 3)
#>      model id sharpness  bias         DSS      CRPS     LogS pit_p_val
#>  1: model1  1 1.1283071  0.04 -0.08898351 0.2580341 1.093753     2e-05
#>  2: model2  1 1.0705271 -0.04  0.07987778 0.2556484 1.087163     2e-05
#>  3: model1  2 1.0057182 -0.92  3.06985466 1.2193656 2.370296     2e-05
#> ---                                                                   
#> 58: model2 29 0.9827873 -0.76  1.13727388 0.6572423 1.489301     2e-05
#> 59: model1 30 0.9887154  1.00  8.70563844 2.2980866 4.012829     2e-05
#> 60: model2 30 0.9985101  0.96  6.58413707 2.3375785 3.107910     2e-05