The goal of dcmodifydb
is to apply modification rules
specified with dcmodify
on a database table, allowing for
documented, reproducable data cleaning adjustments in a database.
This document provides examples and common error corrections
scenario’s for which dcmodify
can be used.
A good practice is to store the intent and description of each rule:
this enables future (co)workers on the dataset to evaluate the use of
each rule. It is therefore recommended to store rules in a
yaml
format and document each rule accordingly.
<- modifier( if (year < 25) year = year + 2000) m
<- modifier( if (year < 100){
m if (year > 25) {
= year + 1900
year else {
} = year + 2000
year
} })
A value is not measured (NA
), but can be deduced to be
of a certain value, e.g. zero. Suppose a health questionair contains
questions if a person smokes and if so how many cigarretes per day.
Typically if the answer to the first question is “no”, the second
question is not asked an thus “unmeasured”, but can be deduced to be
0.
smokes | cigarettes |
---|---|
TRUE | 10 |
FALSE | NA |
if (smokes == FALSE) cigarretes = 0
#> Warning in `[<-.data.frame`(`*tmp*`, I, , value = structure(list(smokes =
#> FALSE, : provided 3 variables to replace 2 variables
smokes | cigarettes |
---|---|
TRUE | 10 |
FALSE | NA |
if
rule<- modifier( if (age < 12) income = 0) m
The following statements are equivalent. It is wise to choose a syntax that is familiar to the persons specifying the correction rules.
if (age < 12) income = 0 # R noobs
if (age < 12) income <- 0 # a bit more R-y
income[age < 12] <- 0 # very R-y
else
Each if
rule may be followed with an else
or else if
<- modifier(if (age > 67) {retired = TRUE} else {retired = FALSE}) m
The following statements are equivalent. It is wise to choose a syntax that is familiar to the persons specifying the correction rules.
if (age > 67) {retired = TRUE} else {retired = FALSE} # R noobs
if (age > 67) {retired <- TRUE} else {retired <- FALSE} # R-y
retired <- if (age > 67) TRUE else FALSE # very R-y
retired <- age > 67 # very R-y
<- modifier(
m if (age > 67) {
= TRUE
retired = 0
salary
} )
else if