Getting started

Nikos Bosse

2022-08-16

The scoringutils package provides a collection of metrics and proper scoring rules that make it simple to score probabilistic forecasts against the true observed values. The scoringutils package offers convenient automated forecast evaluation in a data.table format (using the function score()), but also provides experienced users with a set of reliable lower-level scoring metrics operating on vectors/matriced they can build upon in other applications. In addition it implements a wide range of flexible plots that are able to cover many use cases.

The goal of this package is to provide a tested and reliable collection of metrics that can be used for scoring probabilistic forecasts (forecasts with a full predictive distribution, rather than point forecasts). It has a much stronger focus on convenience than e.g. the scoringRules package, which provides a comprehensive collection of proper scoring rules (also used in scoringutils). In contrast to other packages, scoringutils offers functionality to automatically evaluate forecasts, to visualise scores and to obtain relative scores between models.

Predictions can be handled in various formats: scoringutils can handle probabilistic forecasts in either a sample based or a quantile based format. For more detail on the expected input formats please see below. True values can be integer, continuous or binary.

Input formats

Most of the time, the score() function will be able to do the entire evaluation for you. All you need to do is to pass in a data.frame with the appropriate columns. Which columns are required depends on the format the forecasts come in. The forecast format can either be based on quantiles (see example_quantile for the expected format), based on predictive samples (see example_continuous and example_integer for the expected format in each case) or in a binary format. The following table gives an overview (pairwise comparisons will be explained in more detail below):

Format Required columns
quantile-based ‘true_value’, ‘prediction’, ‘quantile’
sample-based ‘true_value’, ‘prediction’,
‘sample’
binary ‘true_value’, ‘prediction’
pairwise-comparisons additionally a column ‘model’

Additional columns may be present to indicate a grouping of forecasts. For example, we could have forecasts made by different models in various locations at different time points, each for several weeks into the future. It is important, that there are only columns present which are relevant in order to group forecasts. A combination of different columns should uniquely define the unit of a single forecast, meaning that a single forecast is defined by the values in the other columns.

Checking the input data

The function check_forecasts() can be used to check the input data. It gives a summary of what scoringutils thinks you are trying to achieve. It infers the type of the prediction target, the prediction format, and the unit of a single forecasts, gives an overview of the number of unique values per column (helpful for spotting missing data) and returns warnings or errors.

head(example_quantile)
#>    location target_end_date target_type true_value location_name forecast_date
#> 1:       DE      2021-01-02       Cases     127300       Germany          <NA>
#> 2:       DE      2021-01-02      Deaths       4534       Germany          <NA>
#> 3:       DE      2021-01-09       Cases     154922       Germany          <NA>
#> 4:       DE      2021-01-09      Deaths       6117       Germany          <NA>
#> 5:       DE      2021-01-16       Cases     110183       Germany          <NA>
#> 6:       DE      2021-01-16      Deaths       5867       Germany          <NA>
#>    quantile prediction model horizon
#> 1:       NA         NA  <NA>      NA
#> 2:       NA         NA  <NA>      NA
#> 3:       NA         NA  <NA>      NA
#> 4:       NA         NA  <NA>      NA
#> 5:       NA         NA  <NA>      NA
#> 6:       NA         NA  <NA>      NA
check_forecasts(example_quantile)
#> The following messages were produced when checking inputs:
#> 1.  144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.
#> Your forecasts seem to be for a target of the following type:
#> $target_type
#> [1] "integer"
#> 
#> and in the following format:
#> $prediction_type
#> [1] "quantile"
#> 
#> The unit of a single forecast is defined by:
#> $forecast_unit
#> [1] "location"        "target_end_date" "target_type"     "location_name"  
#> [5] "forecast_date"   "model"           "horizon"        
#> 
#> Cleaned data, rows with NA values in prediction or true_value removed:
#> $cleaned_data
#>        location target_end_date target_type true_value location_name
#>     1:       DE      2021-05-08       Cases     106987       Germany
#>     2:       DE      2021-05-08       Cases     106987       Germany
#>     3:       DE      2021-05-08       Cases     106987       Germany
#>     4:       DE      2021-05-08       Cases     106987       Germany
#>     5:       DE      2021-05-08       Cases     106987       Germany
#>    ---                                                              
#> 20397:       IT      2021-07-24      Deaths         78         Italy
#> 20398:       IT      2021-07-24      Deaths         78         Italy
#> 20399:       IT      2021-07-24      Deaths         78         Italy
#> 20400:       IT      2021-07-24      Deaths         78         Italy
#> 20401:       IT      2021-07-24      Deaths         78         Italy
#>        forecast_date quantile prediction                 model horizon
#>     1:    2021-05-03    0.010      82466 EuroCOVIDhub-ensemble       1
#>     2:    2021-05-03    0.025      86669 EuroCOVIDhub-ensemble       1
#>     3:    2021-05-03    0.050      90285 EuroCOVIDhub-ensemble       1
#>     4:    2021-05-03    0.100      95341 EuroCOVIDhub-ensemble       1
#>     5:    2021-05-03    0.150      99171 EuroCOVIDhub-ensemble       1
#>    ---                                                                
#> 20397:    2021-07-12    0.850        352  epiforecasts-EpiNow2       2
#> 20398:    2021-07-12    0.900        397  epiforecasts-EpiNow2       2
#> 20399:    2021-07-12    0.950        499  epiforecasts-EpiNow2       2
#> 20400:    2021-07-12    0.975        611  epiforecasts-EpiNow2       2
#> 20401:    2021-07-12    0.990        719  epiforecasts-EpiNow2       2
#> 
#> Number of unique values per column per model:
#> $unique_values
#>                    model location target_end_date target_type true_value
#> 1: EuroCOVIDhub-ensemble        4              12           2         96
#> 2: EuroCOVIDhub-baseline        4              12           2         96
#> 3:  epiforecasts-EpiNow2        4              12           2         95
#> 4:       UMass-MechBayes        4              12           1         48
#>    location_name forecast_date quantile prediction horizon
#> 1:             4            11       23       3969       3
#> 2:             4            11       23       3733       3
#> 3:             4            11       23       3903       3
#> 4:             4            11       23       1058       3
#> 
#> $messages
#> [1] "144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected."

If you are unsure what your input data should look like, have a look at the example_quantile, example_integer, example_continuous and example_binary data sets provided in the package.

Showing available forecasts

The function avail_forecasts() may also be helpful to determine where forecasts are available. Using the by argument you can specify the level of summary. For example, to see how many forecasts there are per model and target_type, we can run

avail_forecasts(example_quantile, by = c("model", "target_type"))
#> The following messages were produced when checking inputs:
#> 1.  144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.
#>                    model target_type Number forecasts
#> 1: EuroCOVIDhub-ensemble       Cases              128
#> 2: EuroCOVIDhub-baseline       Cases              128
#> 3:  epiforecasts-EpiNow2       Cases              128
#> 4: EuroCOVIDhub-ensemble      Deaths              128
#> 5: EuroCOVIDhub-baseline      Deaths              128
#> 6:       UMass-MechBayes      Deaths              128
#> 7:  epiforecasts-EpiNow2      Deaths              119

We see that ‘epiforecasts-EpiNow2’ has some missing forecasts for the deaths forecast target and that UMass-MechBayes has no case forecasts.

This information can also be visualised using the plot_avail_forecasts() function:

example_quantile %>%
  avail_forecasts(by = c("model", "forecast_date", "target_type")) %>%
  plot_avail_forecasts() +
  facet_wrap(~ target_type)
#> The following messages were produced when checking inputs:
#> 1.  144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.

You can also visualise forecasts directly using the plot_predictions() function:

example_quantile %>%
  make_NA(what = "truth", 
          target_end_date >= "2021-07-15", 
          target_end_date < "2021-05-22"
  ) %>%
  make_NA(what = "forecast",
          model != 'EuroCOVIDhub-ensemble', 
          forecast_date != "2021-06-28"
  ) %>%
  plot_predictions(
    x = "target_end_date",
    by = c("target_type", "location")
  ) +
  facet_wrap(target_type ~ location, ncol = 4, scales = "free") 

Scoring and summarising forecasts

Forecasts can easily be scored using the score() function. This function returns unsumarised scores, which in most cases is not what the user wants. A second function, summarise_scores() takes care of summarising these scores to the level specified by the user. If you like, you can also use sumarise_scores() to round your outputs by specifying e.g. signif() as a summary function.

score(example_quantile) %>%
  head()
#> The following messages were produced when checking inputs:
#> 1.  144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.
#>    location target_end_date target_type location_name forecast_date
#> 1:       DE      2021-05-08       Cases       Germany    2021-05-03
#> 2:       DE      2021-05-08       Cases       Germany    2021-05-03
#> 3:       DE      2021-05-08       Cases       Germany    2021-05-03
#> 4:       DE      2021-05-08       Cases       Germany    2021-05-03
#> 5:       DE      2021-05-08       Cases       Germany    2021-05-03
#> 6:       DE      2021-05-08       Cases       Germany    2021-05-03
#>                    model horizon range interval_score dispersion
#> 1: EuroCOVIDhub-baseline       1     0        25620.0        0.0
#> 2: EuroCOVIDhub-baseline       1    10        25599.5      184.5
#> 3: EuroCOVIDhub-baseline       1    10        25599.5      184.5
#> 4: EuroCOVIDhub-baseline       1    20        25481.0      556.0
#> 5: EuroCOVIDhub-baseline       1    20        25481.0      556.0
#> 6: EuroCOVIDhub-baseline       1    30        25270.2      816.2
#>    underprediction overprediction coverage coverage_deviation bias quantile
#> 1:               0          25620        0                0.0 0.95     0.50
#> 2:               0          25415        0               -0.1 0.95     0.45
#> 3:               0          25415        0               -0.1 0.95     0.55
#> 4:               0          24925        0               -0.2 0.95     0.40
#> 5:               0          24925        0               -0.2 0.95     0.60
#> 6:               0          24454        0               -0.3 0.95     0.35
#>    ae_median quantile_coverage
#> 1:     25620              TRUE
#> 2:     25620              TRUE
#> 3:     25620              TRUE
#> 4:     25620              TRUE
#> 5:     25620              TRUE
#> 6:     25620              TRUE
example_quantile %>%
  score() %>%
  summarise_scores(by = c("model", "target_type")) %>%
  summarise_scores(fun = signif, digits = 2) %>%
  kable()
#> The following messages were produced when checking inputs:
#> 1.  144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.
model target_type interval_score dispersion underprediction overprediction coverage_deviation bias ae_median
EuroCOVIDhub-baseline Cases 28000 4100 10000.0 14000.0 -0.110 0.0980 38000
EuroCOVIDhub-ensemble Cases 18000 3700 4200.0 10000.0 -0.098 -0.0560 24000
epiforecasts-EpiNow2 Cases 21000 5700 3300.0 12000.0 -0.067 -0.0790 28000
EuroCOVIDhub-baseline Deaths 160 91 2.1 66.0 0.120 0.3400 230
EuroCOVIDhub-ensemble Deaths 41 30 4.1 7.1 0.200 0.0730 53
UMass-MechBayes Deaths 53 27 17.0 9.0 -0.023 -0.0220 78
epiforecasts-EpiNow2 Deaths 67 32 16.0 19.0 -0.043 -0.0051 100

The by argument can be used to define the level of summary. By default, by = NULL is set to the unit of a single forecast. For quantile-based forecasts, unsummarised scores are returned for every quantile individually. It can therefore make sense to run summarise_scoreseven without any arguments provided.

score(example_quantile) %>%
  summarise_scores()
#> The following messages were produced when checking inputs:
#> 1.  144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.
#>      location target_end_date target_type location_name forecast_date
#>   1:       DE      2021-05-08       Cases       Germany    2021-05-03
#>   2:       DE      2021-05-08       Cases       Germany    2021-05-03
#>   3:       DE      2021-05-08       Cases       Germany    2021-05-03
#>   4:       DE      2021-05-08      Deaths       Germany    2021-05-03
#>   5:       DE      2021-05-08      Deaths       Germany    2021-05-03
#>  ---                                                                 
#> 883:       IT      2021-07-24      Deaths         Italy    2021-07-05
#> 884:       IT      2021-07-24      Deaths         Italy    2021-07-12
#> 885:       IT      2021-07-24      Deaths         Italy    2021-07-12
#> 886:       IT      2021-07-24      Deaths         Italy    2021-07-12
#> 887:       IT      2021-07-24      Deaths         Italy    2021-07-12
#>                      model horizon interval_score  dispersion underprediction
#>   1: EuroCOVIDhub-baseline       1    16925.04696 1649.220870       0.0000000
#>   2: EuroCOVIDhub-ensemble       1     7990.85478 5440.985217       0.0000000
#>   3:  epiforecasts-EpiNow2       1    25395.96087 8173.700000       0.0000000
#>   4: EuroCOVIDhub-baseline       1       46.79304   44.662609       0.0000000
#>   5: EuroCOVIDhub-ensemble       1       53.88000   53.271304       0.6086957
#>  ---                                                                         
#> 883:  epiforecasts-EpiNow2       3       19.76261   14.284348       0.0000000
#> 884: EuroCOVIDhub-baseline       2       80.33696   76.728261       0.0000000
#> 885: EuroCOVIDhub-ensemble       2       18.65870   13.354348       0.0000000
#> 886:       UMass-MechBayes       2       25.58174    7.755652       0.0000000
#> 887:  epiforecasts-EpiNow2       2       66.16174   25.553043       0.0000000
#>      overprediction coverage_deviation  bias ae_median
#>   1:   15275.826087        -0.38521739  0.95     25620
#>   2:    2549.869565         0.04956522  0.50     12271
#>   3:   17222.260870        -0.29826087  0.90     44192
#>   4:       2.130435         0.22347826  0.30        15
#>   5:       0.000000         0.39739130 -0.10        14
#>  ---                                                  
#> 883:       5.478261         0.04956522  0.50        26
#> 884:       3.608696         0.31043478  0.20        53
#> 885:       5.304348         0.13652174  0.40        30
#> 886:      17.826087        -0.21130435  0.80        46
#> 887:      40.608696        -0.29826087  0.90       108

Scoring point forecasts

Point forecasts can be scored by making use of the quantile-based format, but with a value of NA or "point" in the quantile column. Point forecasts can be scored alongside other quantile-based forecasts. As point forecasts will have values of NA for metrics designed for probabilistic forecasts, it is important to use na.rm = TRUE when summarising.

score(example_point) %>% 
  summarise_scores(by = "model", na.rm = TRUE)
#> The following messages were produced when checking inputs:
#> 1.  144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.
#>                    model interval_score dispersion underprediction
#> 1: EuroCOVIDhub-baseline     14092.7647 2192.26967      4996.92898
#> 2: EuroCOVIDhub-ensemble      8852.4196 1930.80064      2005.56357
#> 3:  epiforecasts-EpiNow2     10659.5125 3084.85850      1604.96135
#> 4:       UMass-MechBayes        51.4781   28.09387        15.32315
#>    overprediction coverage_deviation        bias    ae_point     se_point
#> 1:     6903.56605        0.002102273  0.23905983 19353.42969 2.883446e+09
#> 2:     4916.05540        0.050752841  0.01027027 12077.10156 1.945118e+09
#> 3:     5969.69268       -0.057861612 -0.04915929 14521.10526 2.680928e+09
#> 4:        8.06108       -0.024886364 -0.02847458    78.47656 1.170976e+04
#>    ae_median
#> 1:       NaN
#> 2:       NaN
#> 3:       NaN
#> 4:       NaN

Adding empirical coverage

For quantile-based forecasts we are often interested in specific coverage-levels, for example, what percentage of true values fell between all 50% or the 90% prediction intervals. We can add this information using the function add_coverage(). This function also requires a by argument which defines the level of grouping for which the percentage of true values covered by certain prediction intervals is computed.

score(example_quantile) %>%
  add_coverage(ranges = c(50, 90), by = c("model", "target_type")) %>%
  summarise_scores(by = c("model", "target_type")) %>%
  summarise_scores(fun = signif, digits = 2)
#> The following messages were produced when checking inputs:
#> 1.  144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.
#>                    model target_type interval_score dispersion underprediction
#> 1: EuroCOVIDhub-baseline       Cases          28000       4100         10000.0
#> 2: EuroCOVIDhub-baseline      Deaths            160         91             2.1
#> 3: EuroCOVIDhub-ensemble       Cases          18000       3700          4200.0
#> 4: EuroCOVIDhub-ensemble      Deaths             41         30             4.1
#> 5:       UMass-MechBayes      Deaths             53         27            17.0
#> 6:  epiforecasts-EpiNow2       Cases          21000       5700          3300.0
#> 7:  epiforecasts-EpiNow2      Deaths             67         32            16.0
#>    overprediction coverage_deviation    bias ae_median coverage_50 coverage_90
#> 1:        14000.0             -0.110  0.0980     38000        0.33        0.82
#> 2:           66.0              0.120  0.3400       230        0.66        1.00
#> 3:        10000.0             -0.098 -0.0560     24000        0.39        0.80
#> 4:            7.1              0.200  0.0730        53        0.88        1.00
#> 5:            9.0             -0.023 -0.0220        78        0.46        0.88
#> 6:        12000.0             -0.067 -0.0790     28000        0.47        0.79
#> 7:           19.0             -0.043 -0.0051       100        0.42        0.91

Adding relative scores

In order to better compare models against each other we can use relative scores which are computed based on pairwise comparisons (see details below). Relative scores can be added to the evaluation using the function summarise_scores(). This requires a column called ‘model’ to be present. Pairwise comparisons are computed according to the grouping specified in by: essentially, the data.frame with all scores gets split into different data.frames according to the values specified in by and relative scores are computed for every individual group separately. The baseline argument allows us to specify a baseline that can be used to scale relative scores (all scores are divided by the baseline relative score). For example, to obtain relative scores separately for different forecast targets, we can run

score(example_quantile) %>%
  summarise_scores(by = c("model", "target_type"), 
                   relative_skill = TRUE, 
                   baseline = "EuroCOVIDhub-ensemble")
#> The following messages were produced when checking inputs:
#> 1.  144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.
#>                    model target_type interval_score dispersion underprediction
#> 1: EuroCOVIDhub-baseline       Cases    28483.57465 4102.50094    10284.972826
#> 2: EuroCOVIDhub-baseline      Deaths      159.40387   91.40625        2.098505
#> 3: EuroCOVIDhub-ensemble       Cases    17943.82383 3663.52458     4237.177310
#> 4: EuroCOVIDhub-ensemble      Deaths       41.42249   30.18099        4.103261
#> 5:       UMass-MechBayes      Deaths       52.65195   26.87239       16.800951
#> 6:  epiforecasts-EpiNow2       Cases    20831.55662 5664.37795     3260.355639
#> 7:  epiforecasts-EpiNow2      Deaths       66.64282   31.85692       15.893314
#>    overprediction coverage_deviation        bias   ae_median relative_skill
#> 1:   14096.100883        -0.11211957  0.09796875 38473.60156      1.2947445
#> 2:      65.899117         0.11614130  0.33906250   233.25781      2.2958723
#> 3:   10043.121943        -0.09785326 -0.05640625 24101.07031      0.8156514
#> 4:       7.138247         0.19528533  0.07265625    53.13281      0.5966310
#> 5:       8.978601        -0.02312500 -0.02234375    78.47656      0.7475873
#> 6:   11906.823030        -0.06660326 -0.07890625 27923.81250      0.9469157
#> 7:      18.892583        -0.04287176 -0.00512605   104.74790      0.9765276
#>    scaled_rel_skill
#> 1:         1.587375
#> 2:         3.848060
#> 3:         1.000000
#> 4:         1.000000
#> 5:         1.253014
#> 6:         1.160932
#> 7:         1.636736

Visualising scores

Coloured table

A simple coloured table can be produced based on the scores:

score(example_integer) %>%
  summarise_scores(by = c("model", "target_type"), na.rm = TRUE) %>%
  summarise_scores(fun = signif, digits = 2) %>%
  plot_score_table(by = "target_type") + 
  facet_wrap(~ target_type, nrow = 1)
#> The following messages were produced when checking inputs:
#> 1.  144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.

Score heatmap

We can also summarise one particular metric across different categories using a simple heatmap:

score(example_continuous) %>%
  summarise_scores(by = c("model", "location", "target_type")) %>%
  plot_heatmap(x = "location", metric = "bias") + 
    facet_wrap(~ target_type)
#> The following messages were produced when checking inputs:
#> 1.  144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.

Weighted interval score components

The weighted interval score can be split up into three components: Over-prediction, under-prediction and dispersion. These can be visualised separately in the following way:

score(example_quantile) %>%
  summarise_scores(by = c("target_type", "model")) %>%
  plot_wis() + 
  facet_wrap(~ target_type, scales = "free")
#> The following messages were produced when checking inputs:
#> 1.  144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.

Calibration

Calibration is a measure statistical consistency between the forecasts and the observed values. The most common way of assessing calibration (more precisely: probabilistic calibration) are PIT histograms. The probability integral transform (PIT) is equal to the cumulative distribution function of a forecast evaluated at the true observed value. Ideally, pit values should be uniformly distributed after the transformation.

We can compute pit values as such:

example_continuous %>%
  pit(by = "model") 
#> The following messages were produced when checking inputs:
#> 1.  144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.
#>                      model pit_value
#>   1: EuroCOVIDhub-baseline     0.025
#>   2: EuroCOVIDhub-baseline     0.525
#>   3: EuroCOVIDhub-baseline     0.000
#>   4: EuroCOVIDhub-baseline     0.000
#>   5: EuroCOVIDhub-baseline     0.200
#>  ---                                
#> 883:       UMass-MechBayes     0.950
#> 884:       UMass-MechBayes     0.500
#> 885:       UMass-MechBayes     0.100
#> 886:       UMass-MechBayes     0.450
#> 887:       UMass-MechBayes     0.100

And visualise the results as such:

example_continuous %>%
  pit(by = c("model", "target_type")) %>%
  plot_pit() + 
  facet_grid(model ~ target_type)
#> The following messages were produced when checking inputs:
#> 1.  144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.

Similarly for quantile-based forecasts:

example_quantile[quantile %in% seq(0.1, 0.9, 0.1), ] %>%
  pit(by = c("model", "target_type")) %>%
  plot_pit() +
  facet_grid(model ~ target_type)

Another way to look at calibration are interval coverage and quantile coverage. Interval coverage is the percentage of true values that fall inside a given central prediction interval. Quantile coverage is the percentage of observed values that fall below a given quantile level.

In order to plot interval coverage, you need to include “range” in the by argument to summarise_scores(). The green area on the plot marks conservative behaviour, i.e. your empirical coverage is greater than it nominally need be (e.g. 55% of true values covered by all 50% central prediction intervals.)

example_quantile %>%
  score() %>%
  summarise_scores(by = c("model", "range")) %>%
  plot_interval_coverage()
#> The following messages were produced when checking inputs:
#> 1.  144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.

To visualise quantile coverage, you need to include “quantile” in by. Again, the green area corresponds to conservative forecasts, where central prediction intervals would cover more than needed.

example_quantile %>%
  score() %>%
  summarise_scores(by = c("model", "quantile")) %>%
  plot_quantile_coverage()
#> The following messages were produced when checking inputs:
#> 1.  144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.

Pairwise comparisons

Relative scores for different models can be computed using pairwise comparisons, a sort of pairwise tournament where all combinations of two models are compared against each other based on the overlapping set of available forecasts common to both models. Internally, a ratio of the mean scores of both models is computed. The relative score of a model is then the geometric mean of all mean score ratios which involve that model. When a baseline is provided, then that baseline is excluded from the relative scores for individual models (which therefore differ slightly from relative scores without a baseline) and all relative scores are scaled by (i.e. divided by) the relative score of the baseline model.

In scoringutils, pairwise comparisons can be made in two ways: Through the standalone function pairwise_comparison() or from within summarise_scores() which simply adds relative scores to an existing set of scores.

example_quantile %>%
  score() %>%
  pairwise_comparison(by = "model", baseline = "EuroCOVIDhub-baseline")
#> The following messages were produced when checking inputs:
#> 1.  144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.
#>                     model       compare_against mean_scores_ratio         pval
#>  1: EuroCOVIDhub-baseline  epiforecasts-EpiNow2         1.3703452 9.164893e-18
#>  2: EuroCOVIDhub-baseline EuroCOVIDhub-ensemble         1.5925819 2.608666e-32
#>  3: EuroCOVIDhub-baseline       UMass-MechBayes         3.0275019 2.627464e-20
#>  4: EuroCOVIDhub-baseline EuroCOVIDhub-baseline         1.0000000 1.000000e+00
#>  5: EuroCOVIDhub-ensemble       UMass-MechBayes         0.7867229 1.244731e-04
#>  6: EuroCOVIDhub-ensemble  epiforecasts-EpiNow2         0.8606607 1.881520e-02
#>  7: EuroCOVIDhub-ensemble EuroCOVIDhub-baseline         0.6279112 2.608666e-32
#>  8: EuroCOVIDhub-ensemble EuroCOVIDhub-ensemble         1.0000000 1.000000e+00
#>  9:       UMass-MechBayes EuroCOVIDhub-ensemble         1.2710955 1.244731e-04
#> 10:       UMass-MechBayes  epiforecasts-EpiNow2         0.7439673 7.253878e-03
#> 11:       UMass-MechBayes EuroCOVIDhub-baseline         0.3303053 2.627464e-20
#> 12:       UMass-MechBayes       UMass-MechBayes         1.0000000 1.000000e+00
#> 13:  epiforecasts-EpiNow2       UMass-MechBayes         1.3441452 7.253878e-03
#> 14:  epiforecasts-EpiNow2 EuroCOVIDhub-ensemble         1.1618981 1.881520e-02
#> 15:  epiforecasts-EpiNow2 EuroCOVIDhub-baseline         0.7297431 9.164893e-18
#> 16:  epiforecasts-EpiNow2  epiforecasts-EpiNow2         1.0000000 1.000000e+00
#>         adj_pval relative_skill scaled_rel_skill
#>  1: 3.665957e-17      1.6032604        1.0000000
#>  2: 1.565200e-31      1.6032604        1.0000000
#>  3: 1.313732e-19      1.6032604        1.0000000
#>  4: 1.000000e+00      1.6032604        1.0000000
#>  5: 3.734192e-04      0.8074916        0.5036559
#>  6: 1.881520e-02      0.8074916        0.5036559
#>  7: 1.565200e-31      0.8074916        0.5036559
#>  8: 1.000000e+00      0.8074916        0.5036559
#>  9: 3.734192e-04      0.7475873        0.4662919
#> 10: 1.450776e-02      0.7475873        0.4662919
#> 11: 1.313732e-19      0.7475873        0.4662919
#> 12: 1.000000e+00      0.7475873        0.4662919
#> 13: 1.450776e-02      1.0332277        0.6444541
#> 14: 1.881520e-02      1.0332277        0.6444541
#> 15: 3.665957e-17      1.0332277        0.6444541
#> 16: 1.000000e+00      1.0332277        0.6444541
example_quantile %>%
  score() %>%
  summarise_scores(
    by = "model", relative_skill = TRUE, baseline = "EuroCOVIDhub-baseline"
  )
#> The following messages were produced when checking inputs:
#> 1.  144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.
#>                    model interval_score dispersion underprediction
#> 1: EuroCOVIDhub-baseline    14321.48926 2096.95360      5143.53567
#> 2: EuroCOVIDhub-ensemble     8992.62316 1846.85278      2120.64029
#> 3:       UMass-MechBayes       52.65195   26.87239        16.80095
#> 4:  epiforecasts-EpiNow2    10827.40786 2950.73422      1697.23411
#>    overprediction coverage_deviation        bias   ae_median relative_skill
#> 1:    7081.000000         0.00201087  0.21851562 19353.42969      1.6032604
#> 2:    5025.130095         0.04871603  0.00812500 12077.10156      0.8074916
#> 3:       8.978601        -0.02312500 -0.02234375    78.47656      0.7475873
#> 4:    6179.439535        -0.05516986 -0.04336032 14521.10526      1.0332277
#>    scaled_rel_skill
#> 1:        1.0000000
#> 2:        0.5036559
#> 3:        0.4662919
#> 4:        0.6444541

If using the pairwise_comparison() function, we can also visualise pairwise comparisons by showing the mean score ratios between models. By default, smaller values are better and the model we care about is showing on the y axis on the left, while the model against it is compared is shown on the x-axis on the bottom. In the example above, the EuroCOVIDhub-ensemble performs best (it only has values smaller 1), while the EuroCOVIDhub-baseline performs worst (and only has values larger than 1). For cases, the UMass-MechBayes model is of course excluded as there are no case forecasts available and therefore the set of overlapping forecasts is empty.

example_quantile %>%
  score() %>%
  pairwise_comparison(by = c("model", "target_type")) %>%
  plot_pairwise_comparison() +
  facet_wrap(~ target_type)
#> The following messages were produced when checking inputs:
#> 1.  144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.

Additional analyses and visualisations

Correlation between scores

It may sometimes be interesting to see how different scores correlate with each other. We can examine this using the function correlation(). When dealing with quantile-based forecasts, it is important to call summarise_scorees() before correlation() to summarise over quantiles before computing correlations.

example_quantile %>%
  score() %>%
  summarise_scores() %>%
  correlation()
#> The following messages were produced when checking inputs:
#> 1.  144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.
#>    interval_score dispersion underprediction overprediction coverage_deviation
#> 1:           1.00       0.46            0.28           0.94              -0.34
#> 2:           0.46       1.00            0.15           0.32              -0.12
#> 3:           0.28       0.15            1.00          -0.03              -0.33
#> 4:           0.94       0.32           -0.03           1.00              -0.25
#> 5:          -0.34      -0.12           -0.33          -0.25               1.00
#> 6:           0.11       0.11           -0.35           0.22               0.06
#> 7:           0.99       0.54            0.34           0.90              -0.38
#>     bias ae_median             metric
#> 1:  0.11      0.99     interval_score
#> 2:  0.11      0.54         dispersion
#> 3: -0.35      0.34    underprediction
#> 4:  0.22      0.90     overprediction
#> 5:  0.06     -0.38 coverage_deviation
#> 6:  1.00      0.10               bias
#> 7:  0.10      1.00          ae_median

Visualising correlations:

example_quantile %>%
  score() %>%
  summarise_scores() %>%
  correlation() %>%
  plot_correlation()
#> The following messages were produced when checking inputs:
#> 1.  144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.

Scores by interval ranges

If you would like to see how different forecast interval ranges contribute to average scores, you can visualise scores by interval range:

example_quantile %>%
  score() %>%
  summarise_scores(by = c("model", "range", "target_type")) %>%
  plot_ranges() +
  facet_wrap(~ target_type, scales = "free")
#> The following messages were produced when checking inputs:
#> 1.  144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.

Tips and tricks - converting to sample-based forecasts

Different metrics are available for different forecasting formats. In some cases, you may for example have forecasts in a sample-based format, but wish to make use of some of the functionality only available to quantile-based forecasts. For example, you may want to use the decomposition of the weighted interval score, or may like to compute interval coverage values.

You can convert your forecasts into a sample-based format using the function sample_to_quantile(). There is, however, one caveat: Quantiles will be calculated based on the predictive samples, which may introduce a bias if the number of available samples is small.

example_integer %>%
  sample_to_quantile(
    quantiles = c(0.01, 0.025, seq(0.05, 0.95, 0.05), 0.975, 0.99)
  ) %>%
  score() %>%
  add_coverage(by = c("model", "target_type"))
#> The following messages were produced when checking inputs:
#> 1.  3312 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.
#>                        model target_type location location_name target_end_date
#>     1: EuroCOVIDhub-baseline       Cases       DE       Germany      2021-05-08
#>     2: EuroCOVIDhub-baseline       Cases       DE       Germany      2021-05-08
#>     3: EuroCOVIDhub-baseline       Cases       DE       Germany      2021-05-08
#>     4: EuroCOVIDhub-baseline       Cases       DE       Germany      2021-05-08
#>     5: EuroCOVIDhub-baseline       Cases       DE       Germany      2021-05-08
#>    ---                                                                         
#> 20397:  epiforecasts-EpiNow2      Deaths       IT         Italy      2021-07-24
#> 20398:  epiforecasts-EpiNow2      Deaths       IT         Italy      2021-07-24
#> 20399:  epiforecasts-EpiNow2      Deaths       IT         Italy      2021-07-24
#> 20400:  epiforecasts-EpiNow2      Deaths       IT         Italy      2021-07-24
#> 20401:  epiforecasts-EpiNow2      Deaths       IT         Italy      2021-07-24
#>        forecast_date horizon range interval_score dispersion underprediction
#>     1:    2021-05-03       1     0    29379.00000    0.00000               0
#>     2:    2021-05-03       1    10    27857.47500 3012.07500               0
#>     3:    2021-05-03       1    10    27857.47500 3012.07500               0
#>     4:    2021-05-03       1    20    26443.96000 4109.76000               0
#>     5:    2021-05-03       1    20    26443.96000 4109.76000               0
#>    ---                                                                      
#> 20397:    2021-07-12       2    90       21.21000   21.21000               0
#> 20398:    2021-07-12       2    95       15.00563   15.00563               0
#> 20399:    2021-07-12       2    95       15.00563   15.00563               0
#> 20400:    2021-07-12       2    98        6.28890    6.28890               0
#> 20401:    2021-07-12       2    98        6.28890    6.28890               0
#>        overprediction coverage coverage_deviation bias quantile ae_median
#>     1:        29379.0        0               0.00 0.98    0.500   29379.0
#>     2:        24845.4        0              -0.10 0.98    0.450   29379.0
#>     3:        24845.4        0              -0.10 0.98    0.550   29379.0
#>     4:        22334.2        0              -0.20 0.98    0.400   29379.0
#>     5:        22334.2        0              -0.20 0.98    0.600   29379.0
#>    ---                                                                   
#> 20397:            0.0        1               0.10 0.70    0.950      73.5
#> 20398:            0.0        1               0.05 0.70    0.025      73.5
#> 20399:            0.0        1               0.05 0.70    0.975      73.5
#> 20400:            0.0        1               0.02 0.70    0.010      73.5
#> 20401:            0.0        1               0.02 0.70    0.990      73.5
#>        quantile_coverage coverage_50 coverage_90
#>     1:              TRUE   0.3828125   0.7265625
#>     2:              TRUE   0.3828125   0.7265625
#>     3:              TRUE   0.3828125   0.7265625
#>     4:              TRUE   0.3828125   0.7265625
#>     5:              TRUE   0.3828125   0.7265625
#>    ---                                          
#> 20397:              TRUE   0.5462185   0.9243697
#> 20398:             FALSE   0.5462185   0.9243697
#> 20399:              TRUE   0.5462185   0.9243697
#> 20400:             FALSE   0.5462185   0.9243697
#> 20401:              TRUE   0.5462185   0.9243697

Available metrics

An overview of available metrics can be found in the metrics data set that is included in the package.

metrics
#>                                           Metric               Name
#>  1:                               Absolute error           ae_point
#>  2:                               Absolute error          ae_median
#>  3:                                Squared error           se_point
#>  4:                                Squared error            se_mean
#>  5: (Continuous) ranked probability score (CRPS)               crps
#>  6:                                    Log score          log_score
#>  7:              (Weighted) interval score (WIS)     interval_score
#>  8:                 Dawid-Sebastiani score (DSS)                dss
#>  9:                             Brier score (BS)        brier_score
#> 10:                            Interval coverage           coverage
#> 11:                           Coverage deviation coverage_deviation
#> 12:                            Quantile coverage  quantile_coverage
#> 13:                                   Dispersion         dispersion
#> 14:       Median Absolute Deviation (Dispersion)                mad
#> 15:                      Under-, Over-prediction    underprediction
#> 16:                      Under-, Over-prediction     overprediction
#> 17:         Probability integral transform (PIT)               crps
#> 18:                                   Dispersion         dispersion
#> 19:                                         Bias               bias
#> 20:                             Mean score ratio  mean_scores_ratio
#> 21:                      (Scaled) Relative skill     relative_skill
#> 22:                      (Scaled) Relative skill   scaled_rel_skill
#>                                           Metric               Name
#>                                   Functions Discrete Continuous Binary Quantile
#>  1: score(), ae_point()), ae_median_sample(        +          +      -        +
#>  2: score(), ae_point()), ae_median_sample(        +          +      -        +
#>  3:   score(), se_point(), se_mean_sample()        +          +      -        +
#>  4:   score(), se_point(), se_mean_sample()        +          +      -        +
#>  5:                       score(), ae_point        +          +      -        -
#>  6:   score(), logs_sample(), logs_binary()        -          +      +        -
#>  7:               score(), interval_score()        +          +      -        +
#>  8:                   score(), dss_sample()        +          +      -        -
#>  9:                  score(), brier_score()        -          -      +        -
#> 10:                                 score()        -          -      -        +
#> 11:                                 score()        -          -      -        +
#> 12:                                 score()        +          +      -        -
#> 13:               score(), interval_score()        -          -      -        +
#> 14:                   score(), mad_sample()        +          +      -        -
#> 15:               score(), interval_score()        -          -      -        +
#> 16:               score(), interval_score()        -          -      -        +
#> 17:                          score(), pit()        +          +      -        +
#> 18:               score(), interval_score()        -          -      -        +
#> 19: score(), bias_sample(), bias_quantile()        +          +      -        +
#> 20:                   pairwise_comparison()        ~          ~      ~        ~
#> 21:          score(), pairwise_comparison()        ~          ~      ~        ~
#> 22:          score(), pairwise_comparison()        ~          ~      ~        ~
#>                                   Functions Discrete Continuous Binary Quantile
#>                                                                                                                                                                                 Info
#>  1:                                                                                                                     Suitable for scoring the median of a predictive distribution
#>  2:                                                                                                                     Suitable for scoring the median of a predictive distribution
#>  3:                                                                                                                      Suitable for scoring the mean of a predictive distribution.
#>  4:                                                                                                                      Suitable for scoring the mean of a predictive distribution.
#>  5: Proper scoring rule (smaller is better), takes entire predictive distribution into account (global), penalises over- and under-confidence similarly, stable handling of outliers
#>  6:                 Proper scoring rule, smaller is better, only evaluates predictive density at observed value (local), penalises over-confidence severely, susceptible to outliers
#>  7:                                    Proper scoring rule, smaller is better, similar properties to CRPS and converges to CRPS for an increasing number of equally spaced intervals
#>  8: Proper scoring rule, smaller is better, evaluates forecast based on mean and sd of predictive distribution (global), susceptible to outliers, penalises over-confidence severely
#>  9:                                                          Proper scoring rule, smaller is better, equals CRPS for binary outcomes, penalises over- and under-confidence similarly
#> 10:                       Proportion of observations falling inside a given central prediction interval (= 'empirical interval coverage'). Used to assess probabilistic calibration.
#> 11:                                                                     Average difference between empirical and nominal interval coverage (coverage that should have been realised)
#> 12:                                                               Proportion of observations below a given quantile of the predictive CDF. Used to assess probabilistic calibration.
#> 13:                                                                                                             Dispersion component of WIS, measures width of predictive intervals.
#> 14:                                                                                          Measure for dispersion of a forecast: median of the absolute deviations from the median
#> 15:                                                                                                                  Absolute amount of over-or under-prediction (components of WIS)
#> 16:                                                                                                                  Absolute amount of over-or under-prediction (components of WIS)
#> 17:                                                         PIT transform is the CDF of the predictive distribution evaluated at the observed values. PIT values should be uniform. 
#> 18:                                                                                                             Dispersion component of WIS, measures width of predictive intervals.
#> 19:                                                             Measure of relative tendency to over- or under-predict (aspect of calibration), bounded between -1 and 1 (ideally 0)
#> 20:                                                                                   Compares performance of two models. Properties depend on the metric chosen for the comparison.
#> 21:                               Ranks models based on pairwise comparisons, useful in the context of missing forecasts. Properties depend on the metric chosen for the comparison.
#> 22:                               Ranks models based on pairwise comparisons, useful in the context of missing forecasts. Properties depend on the metric chosen for the comparison.
#>                                                                                                                                                                                 Info

Lower level functions

Most of these metrics are available as lower level functions and extensively documented. Have a look at the help files to understand these better.