Introduction to epidemiological case definitions with diyar

04 December 2021

In epidemiological analyses an exact distinction between temporal events is not always possible. Therefore, static but reasonable cut-offs are used to distinguish one occurrence of an event from another. This is an important aspect of most case definitions. For example, distinguishing repeat or recurrent infections from the first occurrence of that infection.

This vignette provides a brief introduction to how this is done with episodes().

We begin by reviewing event_dt below. This is a sample dataset of 11 infection events and includes data on the location and type of infection.

event_dt <- seq(from = as.Date("2021-01-01"), to = as.Date("2021-01-11"), by = 1)
event_dt <- data.frame(date = event_dt)
event_dt$attr_1 <- c("BSI", "UTI", "RTI", "RTI", "BSI", "BSI", "BSI", "RTI", "RTI", "BSI", "RTI")
event_dt$attr_2 <- c("Ward 1", "Ward 1", "Ward 3", "Ward 3", "Ward 2", "Ward 2", 
                     "Ward 1", "Ward 1", "Ward 3","Ward 3", "Ward 2")

event_dt
#>          date attr_1 attr_2
#> 1  2021-01-01    BSI Ward 1
#> 2  2021-01-02    UTI Ward 1
#> 3  2021-01-03    RTI Ward 3
#> 4  2021-01-04    RTI Ward 3
#> 5  2021-01-05    BSI Ward 2
#> 6  2021-01-06    BSI Ward 2
#> 7  2021-01-07    BSI Ward 1
#> 8  2021-01-08    RTI Ward 1
#> 9  2021-01-09    RTI Ward 3
#> 10 2021-01-10    BSI Ward 3
#> 11 2021-01-11    RTI Ward 2

The simplest kind of case definitions are those where only one event is retained making every other event a duplicate (D) of the index event (C).

event_dt$ep1 <- episodes(event_dt$date, episode_type = "fixed")
event_dt
#>          date attr_1 attr_2     ep1
#> 1  2021-01-01    BSI Ward 1 E.1 (C)
#> 2  2021-01-02    UTI Ward 1 E.1 (D)
#> 3  2021-01-03    RTI Ward 3 E.1 (D)
#> 4  2021-01-04    RTI Ward 3 E.1 (D)
#> 5  2021-01-05    BSI Ward 2 E.1 (D)
#> 6  2021-01-06    BSI Ward 2 E.1 (D)
#> 7  2021-01-07    BSI Ward 1 E.1 (D)
#> 8  2021-01-08    RTI Ward 1 E.1 (D)
#> 9  2021-01-09    RTI Ward 3 E.1 (D)
#> 10 2021-01-10    BSI Ward 3 E.1 (D)
#> 11 2021-01-11    RTI Ward 2 E.1 (D)

As seen above, we can use episodes() to create a group identifier unique to each occurrence. Each group is referred to as an episode. The identifier is stored as an S4 (epid) class with useful information about each group in its slots. These information can be accessed with as.data.frame() or as.list().

We can build on this case definition by adding a temporal boundary within which events are considered part of the same occurrence. For example, we can specify that only events within 4 days (3 day-difference) of an index are considered part of the same occurrence.

event_dt$ep2 <- episodes(event_dt$date, 3, episode_type = "fixed")
event_dt
#>          date attr_1 attr_2     ep1     ep2
#> 1  2021-01-01    BSI Ward 1 E.1 (C) E.1 (C)
#> 2  2021-01-02    UTI Ward 1 E.1 (D) E.1 (D)
#> 3  2021-01-03    RTI Ward 3 E.1 (D) E.1 (D)
#> 4  2021-01-04    RTI Ward 3 E.1 (D) E.1 (D)
#> 5  2021-01-05    BSI Ward 2 E.1 (D) E.5 (C)
#> 6  2021-01-06    BSI Ward 2 E.1 (D) E.5 (D)
#> 7  2021-01-07    BSI Ward 1 E.1 (D) E.5 (D)
#> 8  2021-01-08    RTI Ward 1 E.1 (D) E.5 (D)
#> 9  2021-01-09    RTI Ward 3 E.1 (D) E.9 (C)
#> 10 2021-01-10    BSI Ward 3 E.1 (D) E.9 (D)
#> 11 2021-01-11    RTI Ward 2 E.1 (D) E.9 (D)

We can see that every event after the fourth day has now been assigned to a different episode, each with its own index event. This type of episode is referred to as a "fixed" episode.

episodes() can also produce two additional type of episodes - "rolling" and "recursive". See examples of each below.

event_dt$ep3 <- episodes(event_dt$date, 3, episode_type = "rolling")
event_dt$ep4 <- episodes(event_dt$date, 3, episode_type = "recursive")
event_dt
#>          date attr_1 attr_2     ep1     ep2     ep3     ep4
#> 1  2021-01-01    BSI Ward 1 E.1 (C) E.1 (C) E.1 (C) E.1 (C)
#> 2  2021-01-02    UTI Ward 1 E.1 (D) E.1 (D) E.1 (D) E.1 (D)
#> 3  2021-01-03    RTI Ward 3 E.1 (D) E.1 (D) E.1 (D) E.1 (D)
#> 4  2021-01-04    RTI Ward 3 E.1 (D) E.1 (D) E.1 (D) E.1 (D)
#> 5  2021-01-05    BSI Ward 2 E.1 (D) E.5 (C) E.1 (R) E.1 (R)
#> 6  2021-01-06    BSI Ward 2 E.1 (D) E.5 (D) E.1 (D) E.1 (D)
#> 7  2021-01-07    BSI Ward 1 E.1 (D) E.5 (D) E.1 (D) E.1 (D)
#> 8  2021-01-08    RTI Ward 1 E.1 (D) E.5 (D) E.1 (R) E.1 (R)
#> 9  2021-01-09    RTI Ward 3 E.1 (D) E.9 (C) E.1 (D) E.1 (D)
#> 10 2021-01-10    BSI Ward 3 E.1 (D) E.9 (D) E.1 (D) E.1 (D)
#> 11 2021-01-11    RTI Ward 2 E.1 (D) E.9 (D) E.1 (R) E.1 (R)

Here, we see the creation (or rather categorisation) of recurrent events (R). These are recurrence of index events as opposed to their duplicates. For example, ep3 initially ends on 2021-01-04. However, for rolling and recursive episodes, the function takes the additional step of looking for other events within 4 days of 2021-01-04. Records within this period are then categorised as recurrent events of the 2021-01-04 event, which itself is a duplicate of the 2021-01-01 event. This chaining process continues indefinitely until there are no more events within the period of recurrence or it’s stop by the rolls_max argument.

Even though ep1 has the same outcome (one episode) as ep3 and ep4, they are very different case definitions hence, the structure of their episodes are different. This is demonstrated below by using schema() to visualise the identifiers.

schema(event_dt$ep1, seed = 2, show_labels = c("case_nm", "length_arrow"))

schema(event_dt$ep3, seed = 2, show_labels = c("case_nm", "length_arrow"))

schema(event_dt$ep4, seed = 2, show_labels = c("case_nm", "length_arrow"))

You may have noticed that this process starts from the earliest event and proceeds forward in time. If the from_last argument is changed, the process will happen in the opposite direction.

event_dt$ep5 <- episodes(event_dt$date, 3, episode_type = "rolling", from_last = TRUE)
event_dt
#>          date attr_1 attr_2     ep1     ep2     ep3     ep4      ep5
#> 1  2021-01-01    BSI Ward 1 E.1 (C) E.1 (C) E.1 (C) E.1 (C) E.11 (R)
#> 2  2021-01-02    UTI Ward 1 E.1 (D) E.1 (D) E.1 (D) E.1 (D) E.11 (D)
#> 3  2021-01-03    RTI Ward 3 E.1 (D) E.1 (D) E.1 (D) E.1 (D) E.11 (D)
#> 4  2021-01-04    RTI Ward 3 E.1 (D) E.1 (D) E.1 (D) E.1 (D) E.11 (R)
#> 5  2021-01-05    BSI Ward 2 E.1 (D) E.5 (C) E.1 (R) E.1 (R) E.11 (D)
#> 6  2021-01-06    BSI Ward 2 E.1 (D) E.5 (D) E.1 (D) E.1 (D) E.11 (D)
#> 7  2021-01-07    BSI Ward 1 E.1 (D) E.5 (D) E.1 (D) E.1 (D) E.11 (R)
#> 8  2021-01-08    RTI Ward 1 E.1 (D) E.5 (D) E.1 (R) E.1 (R) E.11 (D)
#> 9  2021-01-09    RTI Ward 3 E.1 (D) E.9 (C) E.1 (D) E.1 (D) E.11 (D)
#> 10 2021-01-10    BSI Ward 3 E.1 (D) E.9 (D) E.1 (D) E.1 (D) E.11 (D)
#> 11 2021-01-11    RTI Ward 2 E.1 (D) E.9 (D) E.1 (R) E.1 (R) E.11 (C)

We can now see that in ep5 the last record is now the index of the episode.

So far, we’ve discussed the different type of episodes and the direction of time in which they are tracked. The next important part of the tracking process is the selection of index events. This determines which record is used as the reference point when checking for other events within its window of occurrence. By default, this will be the earliest record from the direction of the tracking process but this can be changed to a custom preference using the custom_sort argument.

To demonstrate this, let’s preferentially use "RTI" infection events as index events for episodes.

custom_pref <- ifelse(event_dt$attr_1 == "RTI", 1, 2)
event_dt$ep6 <- episodes(event_dt$date, 3, custom_sort = custom_pref, 
                         skip_order = 1)
event_dt$ep7 <- episodes(event_dt$date, 3, custom_sort = custom_pref, 
                         skip_order = 1, from_last = TRUE)
schema(event_dt$ep6, seed = 2, show_labels = c("length_arrow"), 
       custom_label = paste0(decode(event_dt$ep6@case_nm), " (", event_dt$attr_1, ")"))

schema(event_dt$ep7, seed = 2, show_labels = c("length_arrow"), 
       custom_label = paste0(decode(event_dt$ep7@case_nm), " (", event_dt$attr_1, ")"))

Another main aspect of the tracking process is the duration between events. This is handled by the case_length and recurrence_length arguments. A length of 3 means that events within 4 days (3-day difference) of the of the index event will be linked together as part of the same occurrence. Note that this is 4 days towards the direction of the tracking process therefore, events in the opposite direction will not be considered. We can see this above in the schema for ep6.

If records in both directions need to be considered, then a range can be provided as shown with ep_8. Additionally, multiple periods from the index event can be specified by providing a list of lengths or ranges as shown in ep_9.

event_dt$ep8 <- episodes(event_dt$date, number_line(-3, 3), episode_type = "fixed", 
                         custom_sort = custom_pref, skip_order = 1)
event_dt$ep9 <- episodes(event_dt$date, list(number_line(-2, -1),
                                             number_line(4, 5),
                                             number_line(7, 9)), episode_type = "fixed", 
                         custom_sort = custom_pref, 
                         skip_order = 1, 
                         skip_if_b4_lengths = TRUE)
schema(event_dt$ep8, seed = 2, show_labels = c("length_arrow"), 
       custom_label = paste0(decode(event_dt$ep8@case_nm), " (", event_dt$attr_1, ")"))

schema(event_dt$ep9, seed = 2, show_labels = c("length_arrow", "length_label"))

It’s possible to track episodes among events with known start and end points, instead of those occurring at a single point in time. It is worth noting that the reference point for such events is their end point. Therefore, using a case_length of 1 when working with such events is interpreted as 2 days (1-day difference) after the end point of the interval. By default, other events that occur on or after the start point but before the end point of the index event will not be captured. If this is required, use index_window() to get the range required to capture such events, and supply it to the case_length or recurrence_length arguments. See the examples below.

# Dummy data of hospital stays
data(hospital_admissions)
dfr <- hospital_admissions[c("admin_dt", "discharge_dt")]
dfr$admin_period <- number_line(dfr$admin_dt, dfr$discharge_dt)
                                  
# Group overlapping hospital stays
dfr$ep_len1 <- index_window(dfr$admin_period)
dfr$ep1 <- episodes(date = dfr$admin_period, case_length = dfr$ep_len1)

# Group overlapping hospital stays and those within 21 days of the end point of an index hospital stay 
dfr$ep_len2 <- expand_number_line(index_window(dfr$admin_period), 20, "right")
dfr$ep2 <- episodes(date = dfr$admin_period, case_length =  dfr$ep_len2)
dfr[c("admin_period", "ep_len1", "ep_len2", "ep1", "ep2")]
#>                admin_period  ep_len1   ep_len2     ep1     ep2
#> 1  2019-01-01 == 2019-01-01   0 == 0   0 -> 20 E.2 (D) E.2 (D)
#> 2  2019-01-01 -> 2019-01-10  -9 -> 0  -9 -> 20 E.2 (C) E.2 (C)
#> 3  2019-01-10 -> 2019-01-13  -3 -> 0  -3 -> 20 E.2 (D) E.2 (D)
#> 4  2019-01-05 -> 2019-01-06  -1 -> 0  -1 -> 20 E.2 (D) E.2 (D)
#> 5  2019-01-05 -> 2019-01-15 -10 -> 0 -10 -> 20 E.2 (D) E.2 (D)
#> 6  2019-01-07 -> 2019-01-15  -8 -> 0  -8 -> 20 E.2 (D) E.2 (D)
#> 7  2019-01-04 -> 2019-01-13  -9 -> 0  -9 -> 20 E.2 (D) E.2 (D)
#> 8  2019-01-20 -> 2019-01-30 -10 -> 0 -10 -> 20 E.8 (C) E.2 (D)
#> 9  2019-01-26 -> 2019-01-31  -5 -> 0  -5 -> 20 E.8 (D) E.2 (D)
#> 10 2019-01-01 -> 2019-01-10  -9 -> 0  -9 -> 20 E.2 (D) E.2 (D)
#> 11 2019-01-20 -> 2019-01-30 -10 -> 0 -10 -> 20 E.8 (D) E.2 (D)
schema(dfr$ep1, seed = 2, show_labels = "length_arrow")

schema(dfr$ep2, seed = 2, show_labels = "length_arrow")

Currently, the diyar packages recognises 7 mutually exclusive and 2 mutually inclusive ways a pair of intervals can overlap. Please see vignette("number_line") for more details about this. Using case_overlap_methods and recurrence_overlap_methods, the episode tracking process can be made to only capture intervals that overlap by specific methods. See the example below.

# Wrapper function for a fixed episode
episodes_wf <- function(x){
  epids <- episodes(date = dfr$admin_period, 
                    sn = dfr$rd_id, 
                    case_length = index_window(dfr$admin_period), 
                    case_overlap_methods = x)
  return(epids)
}

# Methods 
methods <- list(
  # Identical intervals
  exact = "exact",
  # Intervals with their start or end points within another
  across = "across",
  # Intervals with aligned start points
  aligns_start = "aligns_start",
  # Intervals with aligned end points
  aligns_end = "aligns_end",
  # Intervals with start points that align with the end point of another, and vice versa
  chain = "chain",
  # Intervals occurring completely within others
  inbetween = "inbetween",
  # A combination of `chain` and `inbetween` methods
  cb1 = "chain|inbetween",
  # A combination of `exact`, `chain` and `inbetween` methods
  cb2 = "exact|chain|inbetween",
  # A combination of `across`, `chain` and `aligns_end` methods
  cb3 = "across|chain|aligns_end"
)

epids <- lapply(methods, episodes_wf)
names(epids) <- methods

# Use `schema()` to visualise each.
epids
#> $exact
#>  [1] "E.1 (C)" "E.2 (C)" "E.3 (C)" "E.4 (C)" "E.5 (C)" "E.6 (C)" "E.7 (C)"
#>  [8] "E.8 (C)" "E.9 (C)" "E.2 (D)" "E.8 (D)"
#> 
#> $across
#>  [1] "E.01 (C)" "E.02 (C)" "E.03 (C)" "E.04 (C)" "E.02 (D)" "E.02 (D)"
#>  [7] "E.02 (D)" "E.08 (C)" "E.08 (D)" "E.10 (C)" "E.11 (C)"
#> 
#> $aligns_start
#>  [1] "E.02 (D)" "E.02 (C)" "E.03 (C)" "E.05 (D)" "E.05 (C)" "E.06 (C)"
#>  [7] "E.07 (C)" "E.08 (C)" "E.09 (C)" "E.10 (C)" "E.11 (C)"
#> 
#> $aligns_end
#>  [1] "E.01 (C)" "E.02 (C)" "E.07 (D)" "E.04 (C)" "E.05 (C)" "E.05 (D)"
#>  [7] "E.07 (C)" "E.08 (C)" "E.09 (C)" "E.10 (C)" "E.11 (C)"
#> 
#> $chain
#>  [1] "E.01 (C)" "E.02 (C)" "E.02 (D)" "E.04 (C)" "E.05 (C)" "E.06 (C)"
#>  [7] "E.07 (C)" "E.08 (C)" "E.09 (C)" "E.10 (C)" "E.11 (C)"
#> 
#> $inbetween
#>  [1] "E.01 (C)" "E.02 (C)" "E.05 (D)" "E.02 (D)" "E.05 (C)" "E.06 (C)"
#>  [7] "E.07 (C)" "E.08 (C)" "E.09 (C)" "E.10 (C)" "E.11 (C)"
#> 
#> $`chain|inbetween`
#>  [1] "E.01 (C)" "E.02 (C)" "E.02 (D)" "E.02 (D)" "E.05 (C)" "E.06 (C)"
#>  [7] "E.07 (C)" "E.08 (C)" "E.09 (C)" "E.10 (C)" "E.11 (C)"
#> 
#> $`exact|chain|inbetween`
#>  [1] "E.1 (C)" "E.2 (C)" "E.2 (D)" "E.2 (D)" "E.5 (C)" "E.6 (C)" "E.7 (C)"
#>  [8] "E.8 (C)" "E.9 (C)" "E.2 (D)" "E.8 (D)"
#> 
#> $`across|chain|aligns_end`
#>  [1] "E.01 (C)" "E.02 (C)" "E.02 (D)" "E.04 (C)" "E.02 (D)" "E.02 (D)"
#>  [7] "E.02 (D)" "E.08 (C)" "E.08 (D)" "E.10 (C)" "E.11 (C)"

See overlap_methods$options for a list of supported options.

Additional matching criteria (separate from temporal links) can be implemented by passing a sub_criteria object to the case_sub_criteria and recurrence_sub_criteria arguments. See the vignette("links") for more details about sub_criteria objects. In the example below, we specify that only index events that are an RTI or occurred in Ward 3 can start an episode.

match_funx1 <- function(x, y) y == "RTI"
match_funx2 <- function(x, y) y == "Ward 2"
# Sub-criteria 1 - Matching source of infection OR patient location
sub_cri_1 <- sub_criteria(event_dt$attr_1, event_dt$attr_2, 
                          operator = "or", 
                          match_funcs = c(match_funx1, match_funx2))

event_dt$ep10 <- episodes(event_dt$date, case_length = 5, case_sub_criteria = sub_cri_1)
schema(event_dt$ep10, seed = 2, show_labels = c("length_arrow"),
       custom_label = paste0(event_dt$attr_1, " in ", event_dt$attr_2))

Alternatively, we can specify that BSI events or those that occurred in Ward 1 should not be linked to any episode. Note that they will be able to start episodes unless specified otherwise (y) as with event_dt$ep10.

match_funx3 <- function(x, y) !(x$attr_1 == "BSI" | x$attr_2 == "Ward 1")
equal_funx <- function(x, y) TRUE
# Sub-criteria 1 - Matching source of infection OR patient location
sub_cri_2 <- sub_criteria(attrs(attr_1 = event_dt$attr_1, 
                                attr_2 = event_dt$attr_2), 
                          operator = "and", 
                          match_funcs = match_funx3, 
                          equal_funcs = equal_funx)

event_dt$ep11 <- episodes(event_dt$date, case_length = 5, case_sub_criteria = sub_cri_2)
schema(event_dt$ep11, seed = 2, show_labels = c("length_arrow"),
       custom_label = paste0(event_dt$attr_1, " in ", event_dt$attr_2))

We can even combine both conditions into a nested condition.

combined_sub_cri <- sub_criteria(sub_cri_1, sub_cri_2, operator = "and")
event_dt$ep12 <- episodes(event_dt$date, case_length = 5, 
                          case_sub_criteria = combined_sub_cri)
schema(event_dt$ep12, seed = 2, show_labels = c("length_arrow"),
       custom_label = paste0(event_dt$attr_1, " in ", event_dt$attr_2))

A sub_criteria can be very useful when complex rules are required as part of the case definition.

Finally, everything we’ve just discussed can be done separately for different subsets of the dataset by using the strata argument. data_source lets us keep track of the source of records in each episode. data_links places limits on episode creation based on their data_source. The group_stats argument creates useful information for each episode which are stored in the epid object. To demonstrate these, we’ll modify our first example.

event_dt$ep13 <- episodes(event_dt$date, episode_type = "fixed", 
                          strata = event_dt$attr_2, data_source = event_dt$attr_1, 
                          group_stats = TRUE)

as.data.frame(event_dt$ep13)
#>    epid sn wind_nm     case_nm dist_wind_index dist_epid_index epid_total
#> 1     1  1    Case        Case          0 days          0 days          4
#> 2     1  2    Case Duplicate_C          1 days          1 days          4
#> 3     3  3    Case        Case          0 days          0 days          4
#> 4     3  4    Case Duplicate_C          1 days          1 days          4
#> 5     5  5    Case        Case          0 days          0 days          3
#> 6     5  6    Case Duplicate_C          1 days          1 days          3
#> 7     1  7    Case Duplicate_C          6 days          6 days          4
#> 8     1  8    Case Duplicate_C          7 days          7 days          4
#> 9     3  9    Case Duplicate_C          6 days          6 days          4
#> 10    3 10    Case Duplicate_C          7 days          7 days          4
#> 11    5 11    Case Duplicate_C          6 days          6 days          3
#>    iteration wind_id1 epid_start   epid_end epid_length epid_dataset
#> 1          1        1 2021-01-01 2021-01-08      7 days  BSI,RTI,UTI
#> 2          1        1 2021-01-01 2021-01-08      7 days  BSI,RTI,UTI
#> 3          1        3 2021-01-03 2021-01-10      7 days      BSI,RTI
#> 4          1        3 2021-01-03 2021-01-10      7 days      BSI,RTI
#> 5          1        5 2021-01-05 2021-01-11      6 days      BSI,RTI
#> 6          1        5 2021-01-05 2021-01-11      6 days      BSI,RTI
#> 7          1        1 2021-01-01 2021-01-08      7 days  BSI,RTI,UTI
#> 8          1        1 2021-01-01 2021-01-08      7 days  BSI,RTI,UTI
#> 9          1        3 2021-01-03 2021-01-10      7 days      BSI,RTI
#> 10         1        3 2021-01-03 2021-01-10      7 days      BSI,RTI
#> 11         1        5 2021-01-05 2021-01-11      6 days      BSI,RTI