madrat is a framework that can help structuring data preparation in R. It splits the data preparation into separate steps with each having distinctive requirements about the returned data. The following tutorial will describe the first steps with the package together with the specific requirements for each calculation step.
madrat requires a local directory to store data such as downloaded source data, cache files, and output. Running getConfig
in the package for the first time you will be asked for a folder to use and store that setting permanently (if allowed by the user).
library(madrat)
<- getConfig()
cfg #> Initialize madrat config with default settings..
#> madrat mainfolder for data storage not set! Do you want to set it now? (y/n)
After setting that directory, the package is ready to use. If not stated otherwise in the config, all downloaded source data and created output files can be found in the subdirectories sources/
or output/
of the main directory, respectectively.
madrat splits the process of data preparation into the following components. Note for developers: The source code of each component comes with a madrat wrapper function managing the data preparation process and performing some sanity checks on the calculations. The wrapper functions will run user defined functions which are specific to a certain source or calculation.
The first step in data preparation is downloading the source data. downloadSource
will create a folder for the given source and set all local file paths correctly. The user defined download function must contain the code required to download the source data in to the local folder the script is run from. An example for such a function is madrat:::downloadTau
.
:::downloadTau
madrat#> function (subtype = "paper")
#> {
#> settings <- list(paper = list(title = "Tau Factor (cellular, crop-specific)",
#> description = "Cellular (0.5deg), crop-specific land use intensity (tau) for 1995 and 2000",
#> url = "https://zenodo.org/record/4282581/files/tau-paper.zip",
#> doi = "10.5281/zenodo.4282581"), historical = list(title = "Tau Factor (historic trends)",
#> description = "Historic land use intensity (tau) development",
#> url = "https://zenodo.org/record/4282548/files/tau-historical.zip",
#> doi = "10.5281/zenodo.4282548"))
#> meta <- toolSubtypeSelect(subtype, settings)
#> download.file(meta$url, destfile = "tau.zip", quiet = testthat::is_testing())
#> unzip("tau.zip")
#> unlink("tau.zip")
#> return(list(url = meta$url, doi = meta$doi, title = meta$title,
#> description = meta$description, author = person("Jan Philipp",
#> "Dietrich", email = "dietrich@pik-potsdam.de", comment = "https://orcid.org/0000-0002-4309-6431"),
#> unit = "1", version = "1.0", release_date = "2012-05-10",
#> license = "Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0)",
#> reference = bibentry("Article", title = "Measuring agricultural land-use intensity - A global analysis using a model-assisted approach",
#> author = c(person("Jan Philipp", "Dietrich", email = "dietrich@pik-potsdam.de",
#> comment = "https://orcid.org/0000-0002-4309-6431"),
#> person("Christoph", "Schmitz"), person("Christoph",
#> "Mueller"), person("Marianela", "Fader"), person("Hermann",
#> "Lotze-Campen"), person("Alexander", "Popp")),
#> year = "2012", journal = "Ecological Modelling",
#> volume = "232", pages = "109-118", url = "https://doi.org/10.1016/j.ecolmodel.2012.03.002",
#> doi = "10.1016/j.ecolmodel.2012.03.002")))
#> }
#> <bytecode: 0x565100e12620>
#> <environment: namespace:madrat>
The name of the user function always has to be a combination of the function type (in this case “download”) and the source or calculation type (in this case “Tau”). The wrapper function always expects the source or calculation type as argument. To run downloadTau through the wrapper, one has to use the following call:
downloadSource("Tau", overwrite = TRUE)
Here we set overwrite = TRUE
to make sure that the data is downloaded in any case. In the default case overwrite = FALSE
data will only be downloaded if there is not already an existing source folder containing the data.
As soon as the data is available in the source folder it can be read in. Reading is performed by readSource
and is split into 1 to 3 steps (depending on the data): read, correct and convert.
In the first step the data is read into R and converted to a magclass object. Except of the conversion no other modifications are performed and the content of the magclass object should be completely identical to the downloaded data.
:::readTau
madrat#> function (subtype = "paper")
#> {
#> files <- c(paper = "tau_data_1995-2000.mz", historical = "tau_xref_history_country.mz")
#> file <- toolSubtypeSelect(subtype, files)
#> x <- read.magpie(file)
#> x[x == -999] <- NA
#> return(x)
#> }
#> <bytecode: 0x565100e08ff8>
#> <environment: namespace:madrat>
If one wishes to only read in data (without conversion), this can be done by running readSource
with the argument convert = FALSE
:
<- readSource("Tau", "paper", convert = FALSE) x
If a data source comes with several files it is sometimes necessary to specify a subtype. In the given example the source data comes with two datasets (“paper” and “historical”). In the example above the subtype “paper” is chosen.
The correction step is optional and can be used to fix issues in the data such as removing duplicates, replacing NAs or other corrections. This step is purely about fixing quality problems in the input data. If this step is required one can create a correct-function such as correctTau
for the data source “Tau”. As the example data “Tau” does not require any of these corrections there is no correct function in the given example data.
To allow for flexible aggregation of data to world regions and for compatibility between different data sources madrat imposes a standard spatial resolution on all data sources. The used standard is the ISO 3166-1 3-digit country code standard. The function getISOlist()
returns a vector of these countries.
After conversion the dataset should provide numbers for all countries listed in that standard. The wrapper function readSource
will throw an error if countries are missing. It is important that a best guess is used for countries which are not directly provided by the given source as everything else might lead to errors or critical biases in the follow up calculations. Support tools such as toolCountryFill
help to interpolate the missing information:
:::convertTau
madrat#> function (x)
#> {
#> "!# @monitor madrat:::sysdata$iso_cell magclass:::ncells"
#> "!# @ignore madrat:::toolAggregate"
#> tau <- x[, , "tau"]
#> xref <- x[, , "xref"]
#> xref[is.na(tau) | is.nan(tau)] <- 10^-10
#> tau[is.na(tau) | is.nan(tau)] <- 1
#> if (ncells(x) == 59199) {
#> iso_cell <- sysdata$iso_cell
#> iso_cell[, 2] <- getCells(x)
#> tau <- toolAggregate(tau, rel = iso_cell, weight = collapseNames(xref))
#> xref <- toolAggregate(xref, rel = iso_cell)
#> }
#> tau <- toolCountryFill(tau, fill = 1, TLS = "IDN", HKG = "CHN",
#> SGP = "CHN", BHR = "QAT")
#> xref <- toolCountryFill(xref, fill = 0, verbosity = 2)
#> return(mbind(tau, xref))
#> }
#> <bytecode: 0x565100e1d4b8>
#> <environment: namespace:madrat>
Read and convert can be run together by running readSource
:
<- readSource("Tau", "paper") x
Same as correct
, also the convert
function is optional, but not providing it indicates to madrat that the resulting data is not on ISO country level and will therefore not be available for aggregation to world regions. Cases in which sources will not have a convert function are datasets without spatial resolution (e.g. providing only a global value) or datasets which should for other reasons not be aggregated to country level. For most cases a convert
function should exist.
As the corrections performed in a correct
function are usually similar to the interpolations performed in a convert
function it is also possible to have these corrections just included in the convert
functions. For this reason most sources usually have a read
and a convert
but not a correct
function.
Besides reading in a data source and preparing it for further usage, data preparation often requires to extract certain information out of the given data sources. In contrast to the steps before this can also mean blending two or more datasets into one output dataset. For this reason madrat distinguishes between the source type, which is always linked to a specific source, and a calculation type, which is always linked to a specific data output.
In the given example the data source “Tau” is used to calculate a data output called “TauTotal”.
:::calcTauTotal
madrat#> function (source = "paper")
#> {
#> tau <- readSource("Tau", source)
#> x <- collapseNames(tau[, , "tau.total"])
#> weight <- collapseNames(tau[, , "xref.total"])
#> return(list(x = x, weight = weight, min = 0, max = 10, structure.temporal = "^y[0-9]{4}$",
#> structure.spatial = "^[A-Z]{3}$", unit = "1", description = "Agricultural Land Use Intensity Tau",
#> note = c("data based on Dietrich J.P., Schmitz C., Müller C., Fader M., Lotze-Campen H., Popp A.,",
#> "Measuring agricultural land-use intensity - A global analysis using a model-assisted approach",
#> "Ecological Modelling, Volume 232, 10 May 2012, Pages 109-118, ISSN 0304-3800, https://doi.org/10.1016/j.ecolmodel.2012.03.002."),
#> source = bibentry("Article", title = "Measuring agricultural land-use intensity - A global analysis using a model-assisted approach",
#> author = c(person("Jan Philipp", "Dietrich"), person("Christoph",
#> "Schmitz"), person("Christoph", "Mueller"), person("Marianela",
#> "Fader"), person("Hermann", "Lotze-Campen"),
#> person("Alexander", "Popp")), year = "2012",
#> journal = "Ecological Modelling", volume = "232",
#> pages = "109-118", url = "https://doi.org/10.1016/j.ecolmodel.2012.03.002",
#> doi = "10.1016/j.ecolmodel.2012.03.002")))
#> }
#> <bytecode: 0x565100df6588>
#> <environment: namespace:madrat>
calc-Functions always have to return a list of objects with some list entries mandatory and others optional. Mandatory entries are the calculated data object in magclass format x
, a weight
for aggregating the data from country level to world regions (can be NULL
if the data should just be summed up), a short description
of the dataset, and the unit
of the dataset. Optional statements are for instance a note
with additional details about the data or min
and max
values for the data which will be used for sanity checking the data coming out of the calculation. A full overview about required and/or allowed list entries can be found in the help to calcOutput
(?calcOutput
).
An output calculation can be run with the wrapper function calcOutput
:
<- calcOutput("TauTotal") x
By default it will return the data aggregated to the world regions set in the madrat configuration. Adding the argument aggregate = FALSE
will return the data in its original resolution and is useful if a calc function is used as source for another calc function.
When preparing data for a certain purpose it is often the case that not only one but several datasets have to be prepared as a collection of data. This is where retrieveData
steps in. It allows to create a collection of datasets and manages their calculation and packaging. The user defined functions matching to the wrapper retrieveData
start with full
in the name:
:::fullEXAMPLE
madrat#> function (rev = 0, dev = "")
#> {
#> writeLines("This is a test", paste0(getConfig("outputfolder"),
#> "/test.txt"))
#> if (rev >= 1) {
#> calcOutput("TauTotal", years = 1995, round = 2, file = "fm_tau1995.cs4")
#> }
#> if (dev == "test") {
#> message("Here you could execute code for a hypothetical development version called \"test\"")
#> }
#> return(list(tag = "customizable_tag"))
#> }
#> <bytecode: 0x565100e01078>
#> <environment: namespace:madrat>
Each function must have the argument rev
which contains a revision number. This can be used to package the data differently based on the requested revision of the data. In the given example the calculation “TauTotal” is only performed for revisions greater or equal 1.
retrieveData("example", rev = 1)
retrieveData
will perform the calculations, create log files and package the produced files together with the log files into a compressed tgz file. The file can be found in the ouput folder of the main directory specified in the madrat config.
To have everything proper functioning there are some coding rules to follow:
calcOutput("TauTotal")
instead of type<-"TauTotal";calcOutput(type)
). This is necessary so that the network of functions can be properly detected by the madrat frameworkOwn functions can be made available to madrat just by sourcing them. They can be made visible to madrat by setting the option globalenv = TRUE
. The following example shows how that can look like.
library(madrat)
# add global environment to madrat search path
setConfig(globalenv = TRUE)
# define simple calc-function
<- function() {
calcPi <- toolCountryFill(NULL, fill = pi)
out return(list(x = out,
weight = out,
unit = "1",
description = "Just pi"))
}
# run calcPi through wrapper function calcOutput
calcOutput("Pi")
In the given example calcPi
is a calculation function which is just assigning the value pi to each country and given each country the same weight for a weighted aggregation (pi). After sourcing the function it can be used through the calc-wrapper function calcOutput("Pi")
. The result is the aggregated data to the default region set up.
The same procedure works also for all other functions such as downloadXYZ
, readXYZ
, correctXYZ
, convertXYZ
and fullXYZ
.
Since version 1.00 madrat allows to link packages to it and make use of its functionality. For linking madrat (in version >= 1.00) has to be added as a package dependency.
: madrat(>= 1.00) Depends
In addition the following lines of code should be added as madrat.R
to the R folder of the package.
### madrat.R
#' @importFrom madrat vcat
<- function(libname, pkgname){
.onLoad ::setConfig(packages = c(madrat::getConfig("packages"), pkgname),
madrat.cfgchecks = FALSE, .verbose = FALSE)
}
#create an own warning function which redirects calls to vcat (package internal)
<- function(...) vcat(0, ...)
warning
# create an own stop function which redirects calls to stop (package internal)
<- function(...) vcat(-1, ...)
stop
# create an own cat function which redirects calls to cat (package internal)
<- function(...) vcat(1, ...) cat
The .onLoad
statement makes sure that the package is linked to madrat as soon as it is loaded. The replacements of warning
, stop
and cat
are required to make use of the specific notification system in madrat, which makes for instance sure that all notes, warnings and error messages will show up in the written log files.
Besides these modifications no further changes are required and functions in the new package will be visible to the madrat
wrapper functions.