The goal of dcmodifydb is to apply modification rules specified with
dcmodify
on a database table, allowing for documented,
reproducable data cleaning adjustments in a database.
dcmodify
separates intent from
execution: a user specifies the what ,
why and how of an automatic data change and uses
dcmodifydb
to execute them on a tbl
database
table.
dcmodidfydb
is optimized and restricted to database
tables that can be accessed within R through DBI
. It uses
the dbplyr
package to translate the data correction rules
in R syntax into SQL syntax. The advantage of this approach is that all
data correction is done within the database, which may be a requirement
of your organisation or because the data table is simply too large to be
held in memory. A disadvantage is that not all R statements can be
translated into SQL statement, so dcmodifydb
is more
restricted than dcmodify
which can use the full R
potential. Nonetheless dcmodifydb
may be sufficient and
efficient for many use cases.
For common error scenario’s see
vignette("scenarios", package="dcmodifydb")
. For the
supported syntax for specifying rules see
vignette("syntax", package="dcmodifydb")
.
dcmodifydb
can be installed with
install.packages("dcmodifydb")
and loaded with:
library(dcmodify)
library(dcmodifydb)
dcmodifydb
works on a database table, so we need a
connection to a table within a database.
We set up a database table with sqlite using the person
data set, but for your use case you should connect to your database.
income | age | gender | year | smokes | cigarettes |
---|---|---|---|---|---|
2000 | 12 | M | 2020 | no | 10 |
2010 | 14 | f | 2019 | yes | 4 |
2010 | 25 | v | 19 | no | NA |
1010 | 65 | M | 20 | yes | NA |
<- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
con ::copy_to(con, person) dplyr
We now retrieve a handle to the person table using
dplyr
<- dplyr::tbl(con, "person")
person_tbl
person_tbl#> # Source: table<person> [4 x 6]
#> # Database: sqlite 3.38.5 [:memory:]
#> income age gender year smokes cigarettes
#> <int> <int> <chr> <int> <chr> <int>
#> 1 2000 12 M 2020 no 10
#> 2 2010 14 f 2019 yes 4
#> 3 2010 25 v 19 no NA
#> 4 1010 65 M 20 yes NA
The person dataset clearly contains some errors that can be
corrected. We specify the corrections using modifier rules and apply
them directly with the functionmodify_so
.
First we correct that children can not have an income and that year must be a long year.
library(dcmodify) # needed for modifying rules
library(dcmodifydb) # needed to translate the rules
modify_so( person_tbl
if (age < 16) income = 0
, if (year < 25) year = year + 2000
,
)#> Warning: `copy` not specified, setting `copy=TRUE`, working on copy of table.
#> # Source: table<dcmodifydb_8543374> [4 x 6]
#> # Database: sqlite 3.38.5 [:memory:]
#> income age gender year smokes cigarettes
#> <int> <int> <chr> <int> <chr> <int>
#> 1 0 12 M 2020 no 10
#> 2 0 14 f 2019 yes 4
#> 3 2010 25 v 2019 no NA
#> 4 1010 65 M 2020 yes NA
Note that the corrections are made on a copy of the table by default, to avoid accidents with the data.
A better approach than directly applying corrections is to store the
rules in a modifier
object and apply them in a separate
step to a data base table.
This makes it easier to maintain, use and document a set of rules.
With dcmodify
one can specify rules with the function
modifier
:
# separate rule set
<- modifier( if (age < 16) income = 0
m if (year < 25) year = year + 2000
, if (cigarettes > 0 ) smokes = "yes"
, if (smokes == "no") cigarettes = 0
, <- if (age < 18) "child" else "adult"
, ageclass <- switch( toupper(gender)
, gender "F" = "F"
, "V" = "F" # common mistake
, "M" = "M"
, "NB"
,
) )
m
is now a set of rules that can be applied to a
data.frame
or tbl
.
print(m)
#> Object of class modifier with 6 elements:
#> M1:
#> if (age < 16) income = 0
#>
#> M2:
#> if (year < 25) year = year + 2000
#>
#> M3:
#> if (cigarettes > 0) smokes = "yes"
#>
#> M4:
#> if (smokes == "no") cigarettes = 0
#>
#> M5:
#> ageclass <- if (age < 18) "child" else "adult"
#>
#> M6:
#> gender <- switch(toupper(gender), F = "F", V = "F", M = "M", "NB")
M3 corrects smokes
is “no” for persons who smoke
cigarettes.
M4 sets an unknown cigarettes
to zero for
non-smokers.
M5 shows an example of deriving a new variable
(ageclass
) from existing variables
(age
).
M6 shows an example of recoding, changing labels of categories to the allowed set of labels.
# modify a copy of the table
modify(person_tbl, m, copy = TRUE)
#> # Source: table<dcmodifydb_2573145> [4 x 7]
#> # Database: sqlite 3.38.5 [:memory:]
#> income age gender year smokes cigarettes ageclass
#> <int> <int> <chr> <int> <chr> <int> <chr>
#> 1 0 12 M 2020 yes 10 child
#> 2 0 14 F 2019 yes 4 child
#> 3 2010 25 F 2019 no 0 adult
#> 4 1010 65 M 2020 yes NA adult
Note that the rules are executed sequentially, in the order that they are gven. For example the order of rule M3 and M4 matters: Rule M3 will change record 1 to a smoker, while rule M4 would set the number of cigarettes to 0. This is intentional: correction rules often have an order in which they have to be applied.
A nice properties of modifier
rules, is that they can
store extra metadata. They have a name
, label
and description
that can be used to describe the intention
and the why of a rule. An easy way of describing these properties is by
exporting the ruleset to yaml and specify the rules using the yaml
file.
export_yaml(m, "corrections.yml")
In the export yml file we can label and describe the rules, but also add new rules. Note that label and description are optional, but very much encouraged.
corrections.yml
rules:
- expr: if (age < 16) income = 0
name: M1
label: 'nochildlabor'
description: 'Children are not allowed to work, so can not have income.'
- expr: if (year < 25) year = year + 2000
name: M2
label: 'longyear'
description: 'Convert 2 digits year into 4 digits.'
- expr: if (cigarettes > 0) smokes = "yes"
name: M3
label: 'smoker'
description: 'If you smoke cigarettes you are a smoker...'
- expr: if (smokes == "no") cigarettes = 0
name: M4
label: 'nosmoke'
description: 'If you dont smoke, the (unknown) number of cigarettes is zero'
- expr: ageclass <- if (age < 18) "child" else "adult"
name: M5
label: 'ageclass'
description: 'Derive ageclass using the age variable'
- expr: |
gender <- switch( toupper(gender)
, F = "F"
, V = "F"
, M = "M"
, "NB"
) name: M6
label: 'gender'
description: 'Map the labels for gender to M/F/NB'
We can load these rules with:
<- modifier(.file = "corrections.yml")
m modify(person_tbl, m, copy = TRUE)
#> # Source: table<dcmodifydb_3802595> [4 x 7]
#> # Database: sqlite 3.38.5 [:memory:]
#> income age gender year smokes cigarettes ageclass
#> <int> <int> <chr> <int> <chr> <int> <chr>
#> 1 0 12 M 2020 yes 10 child
#> 2 0 14 F 2019 yes 4 child
#> 3 2010 25 F 2019 no 0 adult
#> 4 1010 65 M 2020 yes NA adult
modify
translates the modification rules into SQL code
and executes the sql queries on the database. For documentation or
implementation purpose it can be useful to see the generated sql code,
with the documented rules.
dump_sql(m, person_tbl, file = "corrections.sql")
corrections.sql:
-- -------------------------------------
-- Generated with dcmodifydb, do not edit
-- dcmodify version: 0.1.9
-- dcmodifydb version: 0.3.1
-- dplyr version: 1.0.9
-- dbplyr version: 2.2.0
-- from: '/tmp/RtmpVrsVJf/Rinst2ca4b29d44356/dcmodifydb/db/corrections.yml'
-- date: 2022-06-17
-- -------------------------------------
ALTER TABLE `person`
ADD `ageclass` TEXT;
-- M1: nochildlabor
-- Children are not allowed to work, so can not have income.
-- R expression: if (age < 16) income = 0
UPDATE `person`
SET `income` = 0.0
WHERE `age` < 16.0;
-- M2: longyear
-- Convert 2 digits year into 4 digits.
-- R expression: if (year < 25) year = year + 2000
UPDATE `person`
SET `year` = `year` + 2000.0
WHERE `year` < 25.0;
-- M3: smoker
-- If you smoke cigarettes you are a smoker...
-- R expression: if (cigarettes > 0) smokes = "yes"
UPDATE `person`
SET `smokes` = 'yes'
WHERE `cigarettes` > 0.0;
-- M4: nosmoke
-- If you dont smoke, the (unknown) number of cigarettes is zero
-- R expression: if (smokes == "no") cigarettes = 0
UPDATE `person`
SET `cigarettes` = 0.0
WHERE `smokes` = 'no';
-- M5: ageclass
-- Derive ageclass using the age variable
-- R expression: ageclass <- if (age < 18) "child" else "adult"
UPDATE `person`
SET `ageclass` = 'child'
WHERE `age` < 18.0;
UPDATE `person`
SET `ageclass` = 'adult'
WHERE NOT(`age` < 18.0);
-- M6: gender
-- Map the labels for gender to M/F/NB
-- R expression: gender <- switch(toupper(gender), F = "F", V = "F", M = "M", "NB")
UPDATE `person`
SET `gender` = CASE UPPER(`gender`) WHEN ('F') THEN ('F') WHEN ('V') THEN ('F') WHEN ('M') THEN ('M') ELSE ('NB') END
;