In the iris dataset, replace Sepal.Width
with 4 value if it exceeds 4.
Data cleaning work flows or scripts typically contain a lot of ‘if this do that’ type of statements. Such statements are typically condensed expert knowledge. With this package, such ‘data modifying rules’ are taken out of the code and become instead parameters to the work flow. This allows you to maintain, document and reason about data modification rules separately from the flow of your programme.
This means you, the expert, can focus on the content and let R do the work.
The workflow of dcmodify
is designed to take two concerns of your hands. The first concern is how to implement the many ideas and rules that define how and when to modify data. The second concern is related to how to apply such rules to your data. We therefore introduce two nouns and one verb that govern the basic workflow.
data.frame
.modifier
: This is an object that stores (conditional) data modification rules.modify
: This is a function that applies the rules in a modifier to your data.Here’s an example using the retailers
data set from the validate package.
## staff turnover other.rev total.rev staff.costs total.costs profit vat
## 1 75 NA NA 1130 NA 18915 20045 NA
## 2 9 1607 NA 1607 131 1544 63 NA
## 3 NA 6886 -33 6919 324 6493 426 NA
First we define a set of modifying rules, using modifier
.
library(dcmodify)
m <- modifier(
if (other.rev < 0) other.rev <- -1 * other.rev
, if ( is.na(staff.costs) ) staff.costs <- mean(staff.costs)
)
Next, the rules can be applied to our data.
Alternatively, if you’re a fan of the magrittr, package you can do this
or even
retailers %<>% modify_so(
if ( other.rev < 0) other.rev <- -1 * other.rev
, if ( is.na(staff.costs) ) staff.costs <- mean(staff.costs)
)
Here, the %<>%
operator makes sure that the original dataset gets overwritten, and modify_so
is a shortcut function for defining modificaton rules in-line.
The rules you define in a modifier
are executed on records where the conditions yields TRUE
. In R this poses the problem on what to do when in a record the condition evaluates to NA
. For example, the condition
other.rev < 0
in the first rule of m
above evaluates to NA
in the first record of the retailers
dataset. Such cases are handled by treating it as if the condition evaluated to FALSE
.
Modifier rules can also be defined and stored outside of the R script through the use of YAML files. Defining a YAML file can be done by hand, or by exporting an existing modifier object via export_yaml
or as_yaml
. Exporting the modifier defined in the Basic workflow section would look as follows:
This code will create a YAML file with the following content:
rules:
- expr: if (other.rev < 0) other.rev <- -1 * other.rev
name: M1
label: ''
description: ''
created: 2021-07-29 16:57:00
origin: command-line
meta: []
- expr: if (is.na(staff.costs)) staff.costs <- mean(staff.costs)
name: M2
label: ''
description: ''
created: 2021-07-29 16:57:00
origin: command-line
meta: []
Out of all these keys only rules:
and expr:
are required, all others are optional.
Once a YAML file is created, modifier
can read the modification rules from the file and store it as a modifier object. For this the .file
argument is used:
Using separate files for the storage of rules has the advantage that the same set of rules can be easily shared across many different scripts.
You, the user can assume that the rules are evaluated record-by-record. In reality, the package is smart enough to analyse the rules a little bit and to make sure they can be evaluated in a vectorized manner. This way explicit (and slow) R-loops are avoided as much as possible.
In short, when you call modify
, or modify_so
, the following steps are performed.
The functionality of this package resembles dplyr::mutate
, since it also allows one to specify data mutations on data frames (or other tabular data objects). The dplyr package is especially useful for interactive use and also for use in programming through ‘underscored’ functions such as mutate_
.
The dcmodify
package has been developed with a production street in mind where similar data sets are processed frequently. By taking the modifying rules out of the software, R programmers can build an application that allows users that are less knowledgeable about programming to specify their modification rules.
It can be interesting to study the effect of a certain set of data modifying rules. The lumberjack package is capable of tracking changes in data.
To start logging data you need to replace the magrittr pipe (%>%
) with the lumberjack operator %>>%
and insert some logging commands into the pipeline.
library(lumberjack)
# add primary key so cellwise changes can be traced
women$id <- letters[1:15]
out <- women %>>%
start_log( cellwise$new(key="id") ) %>>%
modify_so( if (height < mean(height)) height <- mean(height) ) %>>%
dump_log()
## Dumped a log at cellwise.csv
## step time srcref
## 1 1 2021-09-24 11:29:10 CEST NA
## 2 1 2021-09-24 11:29:10 CEST NA
## 3 1 2021-09-24 11:29:10 CEST NA
## 4 1 2021-09-24 11:29:10 CEST NA
## 5 1 2021-09-24 11:29:10 CEST NA
## 6 1 2021-09-24 11:29:10 CEST NA
## expression key variable old
## 1 modify_so(if (height < mean(height)) height <- mean(height)) a height 58
## 2 modify_so(if (height < mean(height)) height <- mean(height)) b height 59
## 3 modify_so(if (height < mean(height)) height <- mean(height)) c height 60
## 4 modify_so(if (height < mean(height)) height <- mean(height)) e height 62
## 5 modify_so(if (height < mean(height)) height <- mean(height)) f height 63
## 6 modify_so(if (height < mean(height)) height <- mean(height)) g height 64
## new
## 1 61
## 2 61
## 3 61
## 4 61
## 5 61
## 6 61
Conditional statements including else
are not supported yet. Rules containing if() else
are ignored with a warning.