Data required for epidemiological investigations are usually stored in different databases. Consolidating these datasets is often the first step for these investigations. This vignette provides a brief introduction to the basics of record linkage as implemented by diyar
.
Let’s begin by reviewing missing_staff_id
- a sample dataset containing incomplete staff information.
data(missing_staff_id)
missing_staff_id#> r_id staff_id age initials hair_colour branch_office source_1 source_2
#> 1 1 NA 30 G.D. Brown Republic of Ghana A 3
#> 2 2 NA 30 B.G. Teal France A 1
#> 3 3 NA 30 X.P. <NA> <NA> A 1
#> 4 4 NA 30 X.P. Green <NA> B 1
#> 5 5 NA 30 <NA> Green France A 1
#> 6 6 2 30 G.D. Dark brown Ghana A 1
#> 7 7 2 30 G.D. Brown Republic of Ghana B 2
A unique identifier that distinguishes one entity (staff) from another is often unavailable or incomplete as is the case with staff_id
in this example. links()
is used to create one. The identifier is created as an S4
class (pid
) with useful information about each group in its slots.
The simplest strategy would be to select one attribute as a distinguishing characteristic for each entity. This is the simple deterministic approach to record linkage.
In the example below, we use initials
and hair_colour
as distinguishing characteristics.
$p1 <- links(criteria = missing_staff_id$initials)
missing_staff_id$p2 <- links(criteria = missing_staff_id$hair_colour)
missing_staff_idc("initials", "hair_colour", "p1", "p2")]
missing_staff_id[#> initials hair_colour p1 p2
#> 1 G.D. Brown P.1 (CRI 001) P.1 (CRI 001)
#> 2 B.G. Teal P.2 (No hits) P.2 (No hits)
#> 3 X.P. <NA> P.3 (CRI 001) P.3 (No hits)
#> 4 X.P. Green P.3 (CRI 001) P.4 (CRI 001)
#> 5 <NA> Green P.5 (No hits) P.4 (CRI 001)
#> 6 G.D. Dark brown P.1 (CRI 001) P.6 (No hits)
#> 7 G.D. Brown P.1 (CRI 001) P.1 (CRI 001)
Unsurprisingly, the uniqueness of identifiers p1
and p2
correspond to the uniqueness of the initials
and hair_colour
respectively. In this case both strategies represent different outcomes. For example, p1
identifies records 3 and 4 as the same person, while p2
has it as records 4 and 5.
To maximise coverage, links()
can implement an ordered multistage deterministic approach to record linkage. For example, we can say that records with matching initials
should be linked to each other, then other records with a matching hair_colour
should then be added to the group.
$p3 <- links(criteria = as.list(missing_staff_id[c("initials", "hair_colour")]))
missing_staff_idc("initials", "hair_colour", "p1", "p2", "p3")]
missing_staff_id[#> initials hair_colour p1 p2 p3
#> 1 G.D. Brown P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#> 2 B.G. Teal P.2 (No hits) P.2 (No hits) P.2 (No hits)
#> 3 X.P. <NA> P.3 (CRI 001) P.3 (No hits) P.3 (CRI 001)
#> 4 X.P. Green P.3 (CRI 001) P.4 (CRI 001) P.3 (CRI 001)
#> 5 <NA> Green P.5 (No hits) P.4 (CRI 001) P.3 (CRI 002)
#> 6 G.D. Dark brown P.1 (CRI 001) P.6 (No hits) P.1 (CRI 001)
#> 7 G.D. Brown P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
We see that p3
now identifies records 3, 4 and 5 as the same person. The logic here is that since record 4 has the same initial
as record 3 and also has the same hair_colour
as record 5, all three are linked together as records of the same entity. Records 3 and 5 are only considered part of the same entity because of their share link with record 4. If record 4 is removed from this dataset, records 3 and 5 will considered part of separate entities. This process is referred to as record expansion.
During record expansion the following rules are implemented.
tie_sort
argument.At each stage, additional matching criteria can be specified. This is done through a sub_criteria
object. This is an S3
class containing attributes to be compared and functions for the comparisons. A sub_criteria
object is used for evaluated, fuzzy and/or nested matches.
For example, we can compare hair_colour
and branch_office
without any order (priority) to them. This is the equivalent of saying matching hair colour OR/AND
branch office.
<- sub_criteria(missing_staff_id$hair_colour,
scri_1 $branch_office,
missing_staff_idoperator = "or")
<- sub_criteria(missing_staff_id$hair_colour,
scri_2 $branch_office,
missing_staff_idoperator = "and")
$p4 <- links(criteria = "place_holder",
missing_staff_idsub_criteria = list(cr1 = scri_1),
recursive = TRUE)
$p5 <- links(criteria = "place_holder",
missing_staff_idsub_criteria = list(cr1 = scri_2),
recursive = TRUE)
c("hair_colour", "branch_office", "p4", "p5")]
missing_staff_id[#> hair_colour branch_office p4 p5
#> 1 Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001)
#> 2 Teal France P.2 (CRI 001) P.2 (No hits)
#> 3 <NA> <NA> P.3 (No hits) P.3 (No hits)
#> 4 Green <NA> P.2 (CRI 001) P.4 (No hits)
#> 5 Green France P.2 (CRI 001) P.5 (No hits)
#> 6 Dark brown Ghana P.6 (No hits) P.6 (No hits)
#> 7 Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001)
There is no limit to the number of sub_criteria
that can be specified but each sub_criteria
must be paired to a criteria
. Any unpaired sub_criteria
will be ignored.
As mentioned, a sub_criteria
can be nested. For example, scri_3
below is the equivalent of saying (scri_1
; matching hair colour OR
branch office) AND
(matching initials OR
branch office).
<- sub_criteria(scri_1,
scri_3 sub_criteria(missing_staff_id$initials,
$branch_office,
missing_staff_idoperator = "or"),
operator = "and")
$p6 <- links(criteria = "place_holder",
missing_staff_idsub_criteria = list(cr1 = scri_3),
recursive = TRUE)
c("hair_colour", "branch_office", "p4", "p5", "p6")]
missing_staff_id[#> hair_colour branch_office p4 p5 p6
#> 1 Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#> 2 Teal France P.2 (CRI 001) P.2 (No hits) P.2 (CRI 001)
#> 3 <NA> <NA> P.3 (No hits) P.3 (No hits) P.3 (No hits)
#> 4 Green <NA> P.2 (CRI 001) P.4 (No hits) P.4 (No hits)
#> 5 Green France P.2 (CRI 001) P.5 (No hits) P.2 (CRI 001)
#> 6 Dark brown Ghana P.6 (No hits) P.6 (No hits) P.6 (No hits)
#> 7 Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
Evaluated matches can be implemented with user-defined functions. The only requirement is that they:
x
and y
, where y
is the value for one observation being compared against the value of all other observations - x
.TRUE
or FALSE
.For example, there are variations of the same hair_colour
and branch_office
values in missing_staff_id
. A quick look and we see that using the last word of each value will improve the linkage result. We can pass a custom function to the sub_criteria
object that will make this comparison. After doing this below (p7
), we see that record 6 has now been linked with records 1 and 7, which was not the case earlier.
# A function to extract the last word in a string
<- function(x) tolower(gsub("^.* ", "", x))
last_word_wf # A logical test using `last_word_wf`.
<- function(x, y) last_word_wf(x) == last_word_wf(y)
last_word_cmp
<- sub_criteria(missing_staff_id$hair_colour,
scri_4 $branch_office,
missing_staff_idmatch_funcs = c(last_word_cmp, last_word_cmp),
operator = "or")
$p7 <- links(criteria = "place_holder",
missing_staff_idsub_criteria = list(cr1 = scri_4),
recursive = TRUE)
c("hair_colour", "branch_office", "p4", "p5", "p6", "p7")]
missing_staff_id[#> hair_colour branch_office p4 p5 p6
#> 1 Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#> 2 Teal France P.2 (CRI 001) P.2 (No hits) P.2 (CRI 001)
#> 3 <NA> <NA> P.3 (No hits) P.3 (No hits) P.3 (No hits)
#> 4 Green <NA> P.2 (CRI 001) P.4 (No hits) P.4 (No hits)
#> 5 Green France P.2 (CRI 001) P.5 (No hits) P.2 (CRI 001)
#> 6 Dark brown Ghana P.6 (No hits) P.6 (No hits) P.6 (No hits)
#> 7 Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#> p7
#> 1 P.1 (CRI 001)
#> 2 P.2 (CRI 001)
#> 3 P.3 (No hits)
#> 4 P.2 (CRI 001)
#> 5 P.2 (CRI 001)
#> 6 P.1 (CRI 001)
#> 7 P.1 (CRI 001)
A sub_criteria
provides a lot flexibly in terms of how attributes are compared however, it can cost a lot in processing time. This is because links()
is an iterative function, comparing batches of record-pairs in iterations. This keeps memory usage down but at the cost of processing time. There are ways to mitigate this such as using the recursive
and check_duplicate
arguments.
In contrast to this, link_records()
creates every possible record-pair of values for each attribute, which are then compared. This is often faster but costs more in memory usage. With a large enough dataset, we will eventually run out of memory, especially when there is no blocking_attribute
. Below we see how link_records()
is used for the same fuzzy matching.
<- link_records(attribute = list(missing_staff_id$hair_colour,
p8 $branch_office),
missing_staff_idcmp_func = c(last_word_cmp, last_word_cmp),
probabilistic = FALSE)
p8#> $pid
#> [1] "P.1 (CRI 001)" "P.2 (CRI 001)" "P.3 (No hits)" "P.2 (CRI 001)"
#> [5] "P.2 (CRI 001)" "P.1 (CRI 001)" "P.1 (CRI 001)"
#>
#> $pid_weights
#> sn_x sn_y cmp.var_1 cmp.var_2 cmp.weight record.match
#> 1 1 2 0 0 0 FALSE
#> 2 1 3 0 0 0 FALSE
#> 3 1 4 0 0 0 FALSE
#> 4 1 5 0 0 0 FALSE
#> 5 1 6 1 1 2 TRUE
#> 6 1 7 1 1 2 TRUE
#> 7 2 3 0 0 0 FALSE
#> 8 2 4 0 0 0 FALSE
#> 9 2 5 0 1 1 TRUE
#> 10 2 6 0 0 0 FALSE
#> 11 2 7 0 0 0 FALSE
#> 12 3 4 0 0 0 FALSE
#> 13 3 5 0 0 0 FALSE
#> 14 3 6 0 0 0 FALSE
#> 15 3 7 0 0 0 FALSE
#> 16 4 5 1 0 1 TRUE
#> 17 4 6 0 0 0 FALSE
#> 18 4 7 0 0 0 FALSE
#> 19 5 6 0 0 0 FALSE
#> 20 5 7 0 0 0 FALSE
#> 21 6 7 1 1 2 TRUE
It’s worth noting that link_records()
does not have the step-wise or nested matching features of links()
however, merge_ids()
can be used to achieve the same result by consolidating the result of multiple instances of link_records()
(pid
objects) in the same way links()
would.
The advantage of link_records()
over links()
is it’s ability to implement a probabilistic approach to record linkage. Its implementation of probabilistic record linkage is based on Fellegi and Sunter (1969) model for deciding if two records belong to the same entity. In summary, m_probabilities
and u_probabilities
, which are the probabilities of a true and false match respectively are used to calculate a final match score for each record-pair. Records below or above a certain score_threshold
are considered matches or non-matches respectively. See help(link_records)
for a more detailed explanation of the method. Below we see the same analysis as above but as a probabilistic record linkage.
<- link_records(attribute = list(missing_staff_id$hair_colour,
p9 $branch_office),
missing_staff_idcmp_func = c(last_word_cmp, last_word_cmp),
probabilistic = TRUE)
p9#> $pid
#> [1] "P.1 (CRI 001)" "P.2 (No hits)" "P.3 (No hits)" "P.4 (No hits)"
#> [5] "P.5 (No hits)" "P.1 (CRI 001)" "P.1 (CRI 001)"
#>
#> $pid_weights
#> sn_x sn_y cmp.var_1 cmp.var_2 cmp.weight prb.var_1 prb.var_2 prb.weight
#> 1 1 2 0 0 0 -3.298333 -3.235597 -6.533930
#> 2 1 3 0 0 0 -2.873027 -2.873027 -5.746054
#> 3 1 4 0 0 0 -3.235597 -2.873027 -6.108624
#> 4 1 5 0 0 0 -3.235597 -3.235597 -6.471194
#> 5 1 6 1 1 2 4.466709 4.466709 8.933417
#> 6 1 7 1 1 2 1.659354 1.659354 3.318708
#> 7 2 3 0 0 0 -3.136062 -2.873027 -6.009089
#> 8 2 4 0 0 0 -3.298333 -2.873027 -6.171360
#> 9 2 5 0 1 1 -3.298333 1.659354 -1.638979
#> 10 2 6 0 0 0 -3.328707 -3.298333 -6.627040
#> 11 2 7 0 0 0 -3.298333 -3.235597 -6.533930
#> 12 3 4 0 0 0 -3.358454 -3.358454 -6.716908
#> 13 3 5 0 0 0 -3.358454 -3.358454 -6.716908
#> 14 3 6 0 0 0 -3.358454 -3.358454 -6.716908
#> 15 3 7 0 0 0 -3.358454 -3.358454 -6.716908
#> 16 4 5 1 0 1 1.659354 -3.358454 -1.699100
#> 17 4 6 0 0 0 -3.298333 -3.358454 -6.656787
#> 18 4 7 0 0 0 -3.235597 -3.358454 -6.594051
#> 19 5 6 0 0 0 -3.298333 -3.298333 -6.596666
#> 20 5 7 0 0 0 -3.235597 -3.235597 -6.471194
#> 21 6 7 1 1 2 4.466709 4.466709 8.933417
#> record.match
#> 1 FALSE
#> 2 FALSE
#> 3 FALSE
#> 4 FALSE
#> 5 TRUE
#> 6 TRUE
#> 7 FALSE
#> 8 FALSE
#> 9 FALSE
#> 10 FALSE
#> 11 FALSE
#> 12 FALSE
#> 13 FALSE
#> 14 FALSE
#> 15 FALSE
#> 16 FALSE
#> 17 FALSE
#> 18 FALSE
#> 19 FALSE
#> 20 FALSE
#> 21 TRUE
As mentioned, with a sufficiently large dataset, we can easily run out of memory. If memory usage is a concern, then use links_wf_probabilistic()
instead. links_wf_probabilistic()
is a wrapper function of links()
with a specific sub_criteria
to achieve probabilistic record linkage. It’s less memory intensive but is usually slower. The reasons for this has already been described above.
<- links_wf_probabilistic(attribute = list(missing_staff_id$hair_colour,
p10 $branch_office),
missing_staff_idcmp_func = c(last_word_cmp, last_word_cmp),
probabilistic = TRUE,
recursive = TRUE)
p10#> $pid
#> [1] "P.1 (CRI 001)" "P.2 (No hits)" "P.3 (No hits)" "P.4 (No hits)"
#> [5] "P.5 (No hits)" "P.1 (CRI 001)" "P.1 (CRI 001)"
#>
#> $pid_weights
#> sn_x sn_y cmp.var_1 cmp.var_2 cmp.weight prb.var_1 prb.var_2 prb.weight
#> 1 1 1 1 1 2 1.733354 1.733354 3.466709
#> 2 2 2 NA NA NA NA NA NA
#> 3 3 3 NA NA NA NA NA NA
#> 4 4 4 NA NA NA NA NA NA
#> 5 5 5 NA NA NA NA NA NA
#> 6 6 1 1 1 2 4.466709 4.466709 8.933417
#> 7 7 1 1 1 2 1.733354 1.733354 3.466709
#> record.match
#> 1 TRUE
#> 2 NA
#> 3 NA
#> 4 NA
#> 5 NA
#> 6 TRUE
#> 7 TRUE
links()
and link_records()
can perform record linkage across multiple datasets in one instance. To achieve this, values from the different datasets should be passed to the functions as single atomic vectors. The corresponding identifier for each dataset should then be passed to the data_source
argument. The strata
argument is a related feature which provides a blocking attribute, limiting the linkage process to subsets of the dataset. See a demonstration of these below.
<- rbind(missing_staff_id[c(4:5, 7)],
triplicate c(4:5, 7)],
missing_staff_id[c(4:5, 7)])
missing_staff_id[$data_source <- c(rep("set_1", 7), rep("set_2", 7), rep("set_3", 7))
triplicate
$p1 <- links(as.list(triplicate[1:2]),
triplicatedata_source = triplicate$data_source,
strata = triplicate$source_1)
$p2 <- links(as.list(triplicate[1:2]),
triplicatestrata = triplicate$data_source,
data_source = triplicate$source_1)
triplicate#> initials hair_colour source_1 data_source p1 p2
#> 1 G.D. Brown A set_1 P.1 (CRI 001) P.01 (CRI 001)
#> 2 B.G. Teal A set_1 P.2 (CRI 001) P.02 (No hits)
#> 3 X.P. <NA> A set_1 P.3 (CRI 001) P.03 (CRI 001)
#> 4 X.P. Green B set_1 P.4 (CRI 001) P.03 (CRI 001)
#> 5 <NA> Green A set_1 P.5 (CRI 002) P.03 (CRI 002)
#> 6 G.D. Dark brown A set_1 P.1 (CRI 001) P.01 (CRI 001)
#> 7 G.D. Brown B set_1 P.7 (CRI 001) P.01 (CRI 001)
#> 8 G.D. Brown A set_2 P.1 (CRI 001) P.08 (CRI 001)
#> 9 B.G. Teal A set_2 P.2 (CRI 001) P.09 (No hits)
#> 10 X.P. <NA> A set_2 P.3 (CRI 001) P.10 (CRI 001)
#> 11 X.P. Green B set_2 P.4 (CRI 001) P.10 (CRI 001)
#> 12 <NA> Green A set_2 P.5 (CRI 002) P.10 (CRI 002)
#> 13 G.D. Dark brown A set_2 P.1 (CRI 001) P.08 (CRI 001)
#> 14 G.D. Brown B set_2 P.7 (CRI 001) P.08 (CRI 001)
#> 15 G.D. Brown A set_3 P.1 (CRI 001) P.15 (CRI 001)
#> 16 B.G. Teal A set_3 P.2 (CRI 001) P.16 (No hits)
#> 17 X.P. <NA> A set_3 P.3 (CRI 001) P.17 (CRI 001)
#> 18 X.P. Green B set_3 P.4 (CRI 001) P.17 (CRI 001)
#> 19 <NA> Green A set_3 P.5 (CRI 002) P.17 (CRI 002)
#> 20 G.D. Dark brown A set_3 P.1 (CRI 001) P.15 (CRI 001)
#> 21 G.D. Brown B set_3 P.7 (CRI 001) P.15 (CRI 001)
The results are different as per the different linkage strategy. We can use as.data.frame
or as.list
to access the contents of each identifier to inspect the difference.
as.data.frame(triplicate$p1)
#> pid sn pid_cri link_id pid_total iteration pid_dataset
#> 1 1 1 1 1 6 1 set_1,set_2,set_3
#> 2 2 2 1 2 3 1 set_1,set_2,set_3
#> 3 3 3 1 3 3 1 set_1,set_2,set_3
#> 4 4 4 1 4 3 1 set_1,set_2,set_3
#> 5 5 5 2 5 3 1 set_1,set_2,set_3
#> 6 1 6 1 1 6 1 set_1,set_2,set_3
#> 7 7 7 1 7 3 1 set_1,set_2,set_3
#> 8 1 8 1 1 6 1 set_1,set_2,set_3
#> 9 2 9 1 2 3 1 set_1,set_2,set_3
#> 10 3 10 1 3 3 1 set_1,set_2,set_3
#> 11 4 11 1 4 3 1 set_1,set_2,set_3
#> 12 5 12 2 5 3 1 set_1,set_2,set_3
#> 13 1 13 1 1 6 1 set_1,set_2,set_3
#> 14 7 14 1 7 3 1 set_1,set_2,set_3
#> 15 1 15 1 1 6 2 set_1,set_2,set_3
#> 16 2 16 1 2 3 1 set_1,set_2,set_3
#> 17 3 17 1 3 3 1 set_1,set_2,set_3
#> 18 4 18 1 4 3 2 set_1,set_2,set_3
#> 19 5 19 2 5 3 1 set_1,set_2,set_3
#> 20 1 20 1 1 6 2 set_1,set_2,set_3
#> 21 7 21 1 7 3 1 set_1,set_2,set_3
as.data.frame(triplicate$p2)
#> pid sn pid_cri link_id pid_total iteration pid_dataset
#> 1 1 1 1 1 3 1 A,B
#> 2 2 2 0 2 1 2 A
#> 3 3 3 1 3 3 1 A,B
#> 4 3 4 1 3 3 1 A,B
#> 5 3 5 2 4 3 1 A,B
#> 6 1 6 1 1 3 1 A,B
#> 7 1 7 1 1 3 2 A,B
#> 8 8 8 1 8 3 1 A,B
#> 9 9 9 0 9 1 2 A
#> 10 10 10 1 10 3 1 A,B
#> 11 10 11 1 10 3 1 A,B
#> 12 10 12 2 11 3 1 A,B
#> 13 8 13 1 8 3 1 A,B
#> 14 8 14 1 8 3 2 A,B
#> 15 15 15 1 15 3 1 A,B
#> 16 16 16 0 16 1 2 A
#> 17 17 17 1 17 3 1 A,B
#> 18 17 18 1 17 3 1 A,B
#> 19 17 19 2 18 3 1 A,B
#> 20 15 20 1 15 3 1 A,B
#> 21 15 21 1 15 3 2 A,B