Using reclin
for deduplication will be demonstrated using an example
Towns with names containing ‘rdam’ or ‘rdm’ have been selected. This should contain most records concerning the two largest cities in The Netherlands: Amsterdam and Rotterdam.
data("town_names")
head(town_names)
#> name official_name
#> 1 alblasserdam Alblasserdam
#> 2 amsterdam Amsterdam
#> 3 amsterdam-z.o. Amsterdam
#> 4 amsterdam-zuidoost Amsterdam
#> 5 amsterdam z-o Amsterdam
#> 6 amsterdam z.o. Amsterdam
First, we do a little bit of cleanup of the names. We have a lot of names of the form ‘amsterdam z.o.’, ‘amsterdam zo’, etc. Removing non-alphanumeric characters will probably help. Also, some of the O’s are written as 0’s (zeros).
town_names$clean_name <- gsub("[^[:alnum:]]", "", town_names$name)
town_names$clean_name <- gsub("0", "o", town_names$clean_name)
We will now compare all records from the dataset to each other. First, we generate all possible pairs of records. However, as it is not necessary to compare the first record to the second and also the second to the first, we will only select pairs for which the second index is larger than the first. This is done by filter_pairs_for_deduplication
. We compare the names using jaro_winkler
and select records for which the Jaro-Winkler similarity is above 0.88. This value is determined by eye-balling the data. Usually values around 0.9 work.
p <- pair_blocking(town_names, town_names) %>%
filter_pairs_for_deduplication() %>%
compare_pairs("clean_name", default_comparator = jaro_winkler()) %>%
score_simsum() %>%
select_threshold(0.88)
head(p)
#> ldat with 6 rows and 5 columns
#> x y clean_name simsum select
#> 1 1 2 0.6679894 0.6679894 FALSE
#> 2 1 3 0.6208514 0.6208514 FALSE
#> 3 2 3 0.9393939 0.9393939 TRUE
#> 4 1 4 0.5459851 0.5459851 FALSE
#> 5 2 4 0.8431373 0.8431373 FALSE
#> 6 3 4 0.8823529 0.8823529 TRUE
We have now selected some town names that we consider the same: records 2 and 3 (record 3 in output above) are the same, and records 3 and 4 (record 6). However, records 2 and 4 are not classified as belonging to the same record (record 5).
In our final step we want to assign each record in our original data set town_names
into a number of groups, each group containing all records with the same town names. The function deduplicate_equivalance
does that. It will use the ‘rules’ derived above: 2 and 3 belong to the same group
, 3 and 4 belong to the same group
, etc., to assign each record to a group. It will, therefore, also assign records 2 and 4 to the same group. For those familiar with graph theory: it derives all subgraphs and assigns all nodes in a subgraph the same identifier.
res <- deduplicate_equivalence(p)
head(res)
#> name official_name clean_name duplicate_groups
#> 1 alblasserdam Alblasserdam alblasserdam 541
#> 2 amsterdam Amsterdam amsterdam 581
#> 3 amsterdam-z.o. Amsterdam amsterdamzo 581
#> 4 amsterdam-zuidoost Amsterdam amsterdamzuidoost 581
#> 5 amsterdam z-o Amsterdam amsterdamzo 581
#> 6 amsterdam z.o. Amsterdam amsterdamzo 581
As we can see records 2 to 6 are assigned to the same group. We can calculate the number of groups and compare that to the original number of town names:
length(unique(res$duplicate_groups))
#> [1] 53
length(unique(res$duplicate_groups))/nrow(res)
#> [1] 0.09075342
We are only left with 53 town names; a reduction of approximately 90 percent. For this small number of remaining groups it is possible to manually derive the correct names, or, if that would be available, we could use the most frequent name in each group as the group name.
Lets assume that we are able to correctly determine the group names. This means that we assign the most frequent official name to each group:
res <- res %>% group_by(duplicate_groups, official_name) %>% mutate(n = n()) %>%
group_by(duplicate_groups) %>%
mutate(group_name = first(official_name, order_by = desc(n)))
We can then calculate the confusion matrix and calculate the precision and recall:
precision <- res %>% group_by(group_name) %>%
summarise(precision = sum(group_name == official_name)/n())
precision_recall <- res %>% group_by(official_name) %>%
summarise(recall = sum(group_name == official_name)/n()) %>%
left_join(precision, by = c("official_name" = "group_name")) %>%
mutate(precision = ifelse(is.na(precision), 0, precision))
precision_recall
#> # A tibble: 19 × 3
#> official_name recall precision
#> <fct> <dbl> <dbl>
#> 1 Alblasserdam 1 1
#> 2 Amsterdam 0.997 0.964
#> 3 Amsterdam-Duivendrecht 1 1
#> 4 Botlek Rotterdam 0.5 1
#> 5 Diemen 0.167 1
#> 6 Europoort Rotterdam 1 1
#> 7 Hoogvliet Rotterdam 0.725 1
#> 8 Leerdam 1 0.778
#> 9 Naarden 1 1
#> 10 Nieuw Amsterdam 0.75 1
#> 11 Pernis Rotterdam 0.688 1
#> 12 Rotterdam 0.993 0.876
#> 13 Rotterdam Albrandswaard 1 1
#> 14 Rotterdam Poortugaal 1 1
#> 15 Rozenburg 1 1
#> 16 Schiedam 0 0
#> 17 Spaarndam 1 1
#> 18 Veendam 0 0
#> 19 Zwammerdam 0.4 1
Overall precision and recall