The goal of {matchmaker} is to provide dictionary-based cleaning for R users in a simple and intuitive manner built on the {forcats} package. Some of the features of this package include:
The matchmaker package has two user-facing functions that perform dictionary-based cleaning:
match_vec()
will translate the values in a single vectormatch_df()
will translate values in all specified columns of a data frameEach of these functions have four manditory options:
x
: your data. This will be a vector or data frame depending on the function.dictionary
: This is a data frame with at least two columns specifying keys and values to modifyfrom
: a character or number specifying which column contains the keysto
: a character or number specifying which column contains the valuesMostly, users will be working with match_df()
to transform values across specific columns. A typical workflow would be to:
library("matchmaker")
# Read in data set
dat <- read.csv(matchmaker_example("coded-data.csv"),
stringsAsFactors = FALSE
)
dat$date <- as.Date(dat$date)
# Read in dictionary
dict <- read.csv(matchmaker_example("spelling-dictionary.csv"),
stringsAsFactors = FALSE
)
This is the top of our data set, generated for example purposes
id | date | readmission | treated | facility | age_group | lab_result_01 | lab_result_02 | lab_result_03 | has_symptoms | followup |
---|---|---|---|---|---|---|---|---|---|---|
ef267c | 2019-07-08 | NA | 0 | C | 10 | unk | high | inc | NA | u |
e80a37 | 2019-07-07 | y | 0 | 3 | 10 | inc | unk | norm | y | oui |
b72883 | 2019-07-07 | y | 1 | 8 | 30 | inc | norm | inc | oui | |
c9ee86 | 2019-07-09 | n | 1 | 4 | 40 | inc | inc | unk | y | oui |
40bc7a | 2019-07-12 | n | 1 | 6 | 0 | norm | unk | norm | NA | n |
46566e | 2019-07-14 | y | NA | B | 50 | unk | unk | inc | NA | NA |
The dictionary looks like this:
options | values | grp | orders |
---|---|---|---|
y | Yes | readmission | 1 |
n | No | readmission | 2 |
u | Unknown | readmission | 3 |
.missing | Missing | readmission | 4 |
0 | Yes | treated | 1 |
1 | No | treated | 2 |
.missing | Missing | treated | 3 |
1 | Facility 1 | facility | 1 |
2 | Facility 2 | facility | 2 |
3 | Facility 3 | facility | 3 |
4 | Facility 4 | facility | 4 |
5 | Facility 5 | facility | 5 |
6 | Facility 6 | facility | 6 |
7 | Facility 7 | facility | 7 |
8 | Facility 8 | facility | 8 |
9 | Facility 9 | facility | 9 |
10 | Facility 10 | facility | 10 |
.default | Unknown | facility | 11 |
0 | 0-9 | age_group | 1 |
10 | 10-19 | age_group | 2 |
20 | 20-29 | age_group | 3 |
30 | 30-39 | age_group | 4 |
40 | 40-49 | age_group | 5 |
50 | 50+ | age_group | 6 |
high | High | .regex ^lab_result_ | 1 |
norm | Normal | .regex ^lab_result_ | 2 |
inc | Inconclusive | .regex ^lab_result_ | 3 |
y | yes | .global | Inf |
n | no | .global | Inf |
u | unknown | .global | Inf |
unk | unknown | .global | Inf |
oui | yes | .global | Inf |
.missing | missing | .global | Inf |
# Clean spelling based on dictionary -----------------------------
cleaned <- match_df(dat,
dictionary = dict,
from = "options",
to = "values",
by = "grp"
)
head(cleaned)
#> id date readmission treated facility age_group lab_result_01
#> 1 ef267c 2019-07-08 Missing Yes Unknown 10-19 unknown
#> 2 e80a37 2019-07-07 Yes Yes Facility 3 10-19 Inconclusive
#> 3 b72883 2019-07-07 Yes No Facility 8 30-39 Inconclusive
#> 4 c9ee86 2019-07-09 No No Facility 4 40-49 Inconclusive
#> 5 40bc7a 2019-07-12 No No Facility 6 0-9 Normal
#> 6 46566e 2019-07-14 Yes Missing Unknown 50+ unknown
#> lab_result_02 lab_result_03 has_symptoms followup
#> 1 High Inconclusive missing unknown
#> 2 unknown Normal yes yes
#> 3 Normal Inconclusive missing yes
#> 4 Inconclusive unknown yes yes
#> 5 unknown Normal missing no
#> 6 unknown Inconclusive missing missing
In addition to strict one-to-one matching, there are a few reserved keywords that will help with correcting data. There are reserved keywords for the ‘from’, ‘to’, and ‘by’ columns in the dictionary.
from
column)The from
column of the dictionary will contain the keys that you want to match in your current data set. These are expected to match exactly with the exception of three reserved keywords that start with a full stop:
.regex [pattern]
: will replace anything matching [pattern]. This is executed before any other replacements are made. The [pattern] should be an unquoted, valid, PERL-flavored regular expression. Any whitespace padding the regular expression is discarded..missing
: replaces any blank cells or NA
values..default
: replaces ALL values that are not defined in the dictionary and are not missing.* Any
NA
values in the keys will be interpreted as “NA” because it’s a common mistake to import the value “NA” to missing in R. If you intend forNA
to indicate missing data, replace it with:dictionary[[1]][is.na(dictionary[[1]])] <- ".missing"
to
column)The values will replace their respective keys exactly as they are presented with one exception. There is currently one recognised keyword that can be placed in the to column of your dictionary:
.na
: Replace keys with missing data. When used in combination with the .missing
keyword (in column 1), it can allow you to differentiate between explicit and implicit missing data.For example, let’s say you have the following data set of people asked if they like ice cream:
who <- c("Anakin", "Darth", "R2-D2", "Leia", "C-3PO", "Rey", "Obi-Wan", "Luke", "Chewy", "Owen", "Lando")
icecream <- c(letters[1:3], "NO", "N", "yes", "Y", "n", "n", NA, "")
names(icecream) <- who
icecream
#> Anakin Darth R2-D2 Leia C-3PO Rey Obi-Wan Luke Chewy Owen
#> "a" "b" "c" "NO" "N" "yes" "Y" "n" "n" NA
#> Lando
#> ""
You could contstruct a dictionary that has a 1:1 relationship between the keys that looks like this:
keys | values |
---|---|
yes | Yes |
Y | Yes |
n | No |
N | No |
NO | No |
.missing | .na |
.default | (invalid) |
Once you read in the file (either via read.csv()
or readxl::read_excel()
if you use excel), you can use it as a dictionary. This dictionary will do three things:
NA
values to explicit missing data.Now we have nice, predictable values, but let’s say Luke really didn’t like ice cream. If he responded “NOOOOOOO” instead of “n”, then the dictionary we specified would convert it to “(invalid)”:
icecream["Luke"] <- "NOOOOOOO"
match_vec(icecream, dictionary = my_dict1, from = "keys", to = "values")
#> Warning: 'NOOOOOOO', 'a', 'b', 'c' were changed to the default value
#> ('(invalid)')
#> Anakin Darth R2-D2 Leia C-3PO Rey
#> "(invalid)" "(invalid)" "(invalid)" "No" "No" "Yes"
#> Obi-Wan Luke Chewy Owen Lando
#> "Yes" "(invalid)" "No" NA NA
We can fix this if we use pattern matching. Here we are selecting from any valid spelling of yes/no with trailing letters so that it capture’s Luke’s extreme objection to ice cream. To do this we add the .regex
prefix (note the space after .regex
):
keys | values |
---|---|
.regex ^[Yy][Ee]?[Ss]*$ | Yes |
.regex ^[Nn][Oo]*$ | No |
.missing | .na |
.default | (invalid) |
match_vec(icecream, dictionary = my_dict2, from = "keys", to = "values")
#> Warning: 'a', 'b', 'c' were changed to the default value ('(invalid)')
#> Anakin Darth R2-D2 Leia C-3PO Rey
#> "(invalid)" "(invalid)" "(invalid)" "No" "No" "Yes"
#> Obi-Wan Luke Chewy Owen Lando
#> "Yes" "No" "No" NA NA
The drawback to fuzzy matching is that it will convert things that match the pattern, so be very careful when constructing your keys.
When using the match_df()
function, you would construct the dictionary same as you would above, with two extra columns that specify the column name in the data frame and the order the resulting values should be (if the column is a factor).
As with match_vec()
, all the same keywords apply, but now there are also two keywords for the columns:
.regex [pattern]
: any column whose name is matched by [pattern]. The [pattern] should be an unquoted, valid, PERL-flavored regular expression. This will match any column that is named with a given pattern. This would commonly be used for recoding results from columns that all start with the same pattern: ^lab_result_
would match lab_result_QTPCR
, lab_result_WBC
, lab_result_iron
..global
: defines rules for any column that is a character or factor and any column named in the dictionary. If you want to apply a set of definitions to all valid columns in addition to specified columns, then you can include a .global
group in the by
column of your ‘dictionary’ data frame. This is useful for setting up a dictionary of common spelling errors. NOTE: specific variable definitions will override global defintions. For example: if you have a column for cardinal directions and a definiton for N = North
, then the global variable N = no
will not override that..regex
Before you use regex, you should be aware of three special symbols that will help anchor your words and prevent any unintended matching.
^
) should be placed at the beginning of a pattern to show that it’s the beginning of the word. For example, lab
will match both lab_result
and granite_slab
, but ^lab
will only match lab_result
$
) should be placed at the end of a pattern to show that it’s the end of a word. For example, date
will match both admission_date
and date_of_onset
, but date$
will only match admission_date$
..
) matches any character. Because it’s common in column names imported by R, it’s a good idea to wrap it in square brackets ([.]
) to tell R that you actually mean a dot. For example, ^lab.r$
will match lab.r
, lab_r
, and labor
, but ^lab[.]r$
will only match lab.r
.The best strategy is to use at least one anchor to prevent it greedily selecting columns to match.
In our example from the top, there are three columns that all start with lab_result_
, so we use the .regex ^lab_result
keyword:
# view the lab_result columns:
print(labs <- grep("^lab_result_", names(dat), value = TRUE))
#> [1] "lab_result_01" "lab_result_02" "lab_result_03"
str(dat[labs])
#> 'data.frame': 50 obs. of 3 variables:
#> $ lab_result_01: chr "unk" "inc" "inc" "inc" ...
#> $ lab_result_02: chr "high" "unk" "norm" "inc" ...
#> $ lab_result_03: chr "inc" "norm" "inc" "unk" ...
# show the lab_result part of the dictionary:
print(dict[grep("^[.]regex", dict$grp), ])
#> options values grp orders
#> 25 high High .regex ^lab_result_ 1
#> 26 norm Normal .regex ^lab_result_ 2
#> 27 inc Inconclusive .regex ^lab_result_ 3
# clean the data and compare the result
cleaned <- match_df(dat, dict,
from = "options",
to = "values",
by = "grp",
order = "orders"
)
str(cleaned[labs])
#> 'data.frame': 50 obs. of 3 variables:
#> $ lab_result_01: chr "unknown" "Inconclusive" "Inconclusive" "Inconclusive" ...
#> $ lab_result_02: chr "High" "unknown" "Normal" "Inconclusive" ...
#> $ lab_result_03: chr "Inconclusive" "Normal" "Inconclusive" "unknown" ...
.global
to clean up all character/factor columnsWe’ve actually seen the .global
keyword in use already. Let’s take one more look at the results from above:
# show the lab_result part of the dictionary:
print(dict[grep("^[.]regex", dict$grp), ])
#> options values grp orders
#> 25 high High .regex ^lab_result_ 1
#> 26 norm Normal .regex ^lab_result_ 2
#> 27 inc Inconclusive .regex ^lab_result_ 3
# show the original data
str(dat[labs])
#> 'data.frame': 50 obs. of 3 variables:
#> $ lab_result_01: chr "unk" "inc" "inc" "inc" ...
#> $ lab_result_02: chr "high" "unk" "norm" "inc" ...
#> $ lab_result_03: chr "inc" "norm" "inc" "unk" ...
# show the modified data
str(cleaned[labs])
#> 'data.frame': 50 obs. of 3 variables:
#> $ lab_result_01: chr "unknown" "Inconclusive" "Inconclusive" "Inconclusive" ...
#> $ lab_result_02: chr "High" "unknown" "Normal" "Inconclusive" ...
#> $ lab_result_03: chr "Inconclusive" "Normal" "Inconclusive" "unknown" ...
Notice above how there are rules for “high”, “norm”, and “inc”, but not for “unk”, which was turned into “unknown”? This is because of the global keywords:
print(dict[grep("^[.](regex|global)", dict$grp), ])
#> options values grp orders
#> 25 high High .regex ^lab_result_ 1
#> 26 norm Normal .regex ^lab_result_ 2
#> 27 inc Inconclusive .regex ^lab_result_ 3
#> 28 y yes .global Inf
#> 29 n no .global Inf
#> 30 u unknown .global Inf
#> 31 unk unknown .global Inf
#> 32 oui yes .global Inf
#> 33 .missing missing .global Inf
The “unk” keyword was defined in our global dictionary and has been used to translate “unk” to “unknown”.
Of course, be very careful with this one.
Internally, the match_vec()
function can be quite noisy with warnings for various reasons. Thus, by default, the match_df()
function will keep these quiet, but you can have them printed to your console if you use the warn = TRUE
option:
cleaned <- match_df(dat, dict,
from = "options",
to = "values",
by = "grp",
order = "orders",
warn = TRUE
)
#>
#> ── Warnings were found in the following columns ──
#>
#> ● age_group
#> 1. ⚠ None of the variables in `age_group` were found in the global
#> dictionary. Did you use the correct dictionary?
#> ● facility
#> 1. ⚠ None of the variables in `facility` were found in the global
#> dictionary. Did you use the correct dictionary?
#> 2. ⚠ 'A', 'B', 'C' were changed to the default value ('Unknown')
#> ● readmission
#> 1. ⚠ None of the variables in `readmission` were found in the global
#> dictionary. Did you use the correct dictionary?
#> ● treated
#> 1. ⚠ None of the variables in `treated` were found in the global
#> dictionary. Did you use the correct dictionary?
#> ● id
#> 1. ⚠ None of the variables in `id` were found in `dict`. Did you use the
#> correct dictionary?