The purpose of this package is to assist in working with strings that are effectively equivalent, yet are not quite identical. The functions take a character vector as input, identify and cluster similar values, and then merge clusters together so their values become identical. The clustering performed by these functions are implementations of the “key collision” and “ngram fingerprint” algorithms from the open source tool Open Refine. More info on key collision and ngram fingerprint can be found here.
In addition, there are a few add-on features included, to make the clustering/merging functions more useful. These include approximate string matching to allow for merging despite minor mispellings, the option to pass a dictionary vector to dictate edit values, and the option to pass a vector of strings to ignore during the clustering process.
This package provides two exported functions,
key_collision_merge
and n_gram_merge
. Below is
an explanation of each.
library(refinr)
<- c("Acme Pizza, Inc.", "AcMe PiZzA, Inc.", "ACME PIZZA COMPANY", "acme pizza LLC")
x key_collision_merge(x)
#> [1] "ACME PIZZA COMPANY" "ACME PIZZA COMPANY" "ACME PIZZA COMPANY"
#> [4] "ACME PIZZA COMPANY"
Argument bus_suffix
allows the clustering to be
insensitive to common business name suffix strings (i.e. “inc”, “llc”,
“co”, etc.). The default value is TRUE
.
# Set bus_suffix to FALSE to see the difference (only the first two strings get merged).
key_collision_merge(x, bus_suffix = FALSE)
#> [1] "AcMe PiZzA, Inc." "AcMe PiZzA, Inc." "ACME PIZZA COMPANY"
#> [4] "acme pizza LLC"
A character vector can be passed to argument dict
, which
will dictate merge values when a cluster has a match within the dict
vector.
key_collision_merge(x, dict = c("Acme Pizza, Incorporated"))
#> [1] "Acme Pizza, Incorporated" "Acme Pizza, Incorporated"
#> [3] "Acme Pizza, Incorporated" "Acme Pizza, Incorporated"
To specify strings to ignore during the clustering process, pass a
character vector to argument ignore_strings
.
<- c("Bakersfield Highschool", "BAKERSFIELD high", "high school, bakersfield")
x key_collision_merge(x, ignore_strings = c("high", "school", "highschool"))
#> [1] "BAKERSFIELD high" "BAKERSFIELD high" "BAKERSFIELD high"
These args can also be used in combination with each other.
key_collision_merge(x, ignore_strings = c("high", "school", "highschool"), dict = c("Bakersfield High School"))
#> [1] "Bakersfield High School" "Bakersfield High School"
#> [3] "Bakersfield High School"
Works similarly to key_collision_merge
, however it
features approximate string matching, which allows for merging of
strings that contain slight spelling differences. The stringdist
package is used for calculating edit distance between strings.
refinr
links to the stringdist C API to improve the speed
of the functions.
<- c("Acme Pizza, Inc.", "ACME PIZA COMPANY", "Acme Pizzazza LLC")
x n_gram_merge(x)
#> [1] "ACME PIZA COMPANY" "ACME PIZA COMPANY" "ACME PIZA COMPANY"
The performance of the approximate string matching can be ajusted
using parameters weight
and/or
edit_threshold
.
n_gram_merge(x, weight = c(d = 1, i = 0.4, s = 0.2, t = 0.2))
#> [1] "ACME PIZA COMPANY" "ACME PIZA COMPANY" "Acme Pizzazza LLC"
n_gram_merge()
is designed to accept any of the
arguments that stringdist::stringdistmatrix
takes. These args will be passed along to the the
stringdist
functions.
n_gram_merge(x, method = "soundex", useBytes = TRUE)
#> [1] "ACME PIZA COMPANY" "ACME PIZA COMPANY" "Acme Pizzazza LLC"
n_gram_merge
also features arguments
bus_suffix
and ignore_strings
, that operate
the same way they do for function key_collision_merge
.
<- c("Bakersfield Highschool", "BAKERSFIELD high", "high school, bakersfield")
x n_gram_merge(x, ignore_strings = c("high", "school", "highschool"))
#> [1] "BAKERSFIELD high" "BAKERSFIELD high" "BAKERSFIELD high"
library(dplyr)
<- c(
x "Clemsson University",
"university-of-clemson",
"CLEMSON",
"Clem son, U.",
"college, clemson u",
"M.I.T.",
"Technology, Massachusetts' Institute of",
"Massachusetts Inst of Technology",
"UNIVERSITY: mit"
)
<- c("university", "college", "u", "of", "institute", "inst")
ignores <- x %>%
x_refin key_collision_merge(ignore_strings = ignores) %>%
n_gram_merge(ignore_strings = ignores)
# Print results.
cat(paste(x_refin, collapse = "<br />"))
CLEMSON
CLEMSON
CLEMSON
CLEMSON
CLEMSON
M.I.T.
Massachusetts
Inst of Technology
Massachusetts Inst of Technology
M.I.T.
# Create df for comparing the original values to the edited values.
# This is especially useful for larger input vectors.
<- data_frame(original_values = x, edited_values = x_refin) %>%
inspect_results mutate(equal = original_values == edited_values)
#> Warning: `data_frame()` was deprecated in tibble 1.1.0.
#> Please use `tibble()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
# Display only the values that were edited by refinr.
::kable(
knitr!inspect_results$equal, c("original_values", "edited_values")],
inspect_results[format = "html",
table.attr = "style='width:100%;'"
)
original_values | edited_values |
---|---|
Clemsson University | CLEMSON |
university-of-clemson | CLEMSON |
Clem son, U. | CLEMSON |
college, clemson u | CLEMSON |
Technology, Massachusetts’ Institute of | Massachusetts Inst of Technology |
UNIVERSITY: mit | M.I.T. |