Testing the package is an important part of package development. Although we are unit-testing the functions, there is always a situation that is not tested. Please feel free to try the code under different conditions. Please report the bugs as a new issue. In the following, there steps to generate synthetic data and conduct some testing.
In order to test the package, you need to have the code on your system (R (>= 3.5.0)). There are two options:
If you are running the code on a cluster and trying to carry out some test on scaling out, installing it is the best option. However, if you want to dig deeper and debug the code interactively, cloning the code is your choice. If you are planning to test, debug, and commit the changes, use forking. In the following, we explain all these options.
Use devtools::install_github
to install the package. If
you do not specify the ref
, it will install the master (or
main) branch. The master
branch hosts the latest released
code. The latest updates are committed to the develop
branch. For more details please refer Git Branching Model in
the Developers Guide section.
library(devtools)
try(detach("package:CausalGPS", unload = TRUE), silent = TRUE) # if already you have the package, detach and unload it, to have a new install.
install_github("fasrc/CausalGPS", ref="develop")
library(CausalGPS)
The process should run smoothly. Try ?CausalGPS
. It
should open the package description page under the help tab (assuming
you are using RStudio).
Go to the package Github repository, at the
top left (A), choose the branch that you want to clone, then click on
the Code
button (B), then choose Download ZIP
.
The following figure shows the buttons’ location.
After cloning the code, open one of the files using RStudio, then
change the project directory to the project directory
(Session > Set Working Directory > To Project Directory
).
Forking is explained in the Contribution page.
There are some commands that you need to use during testing and debugging the package. You can read about these commands in R Packages book. For debugging and testing, you do not need to know all steps. Here is the list of commands that you need:
document()
if you change Roxygen Skeleton (e.g., you
added new argument to the function, or improved the example), run
documents()
. It will make sure that your internal
documentation is updated.test()
runs all tests that is located inside the
tests/testthat
folder. Any modification to the code should
be followed by running test()
to make sure that you are not
breaking existing functionality.check()
checks many other features (e.g., whether the
package can be installed without problem). You do not need to run
check()
frequently; however, we would suggest running it
once you get the code and once before committing to make sure that there
are no errors, warnings, and notes.load_all()
the program source code is different than
the one which is loaded on your memory (read more here). If you
modify some part of the code, you need to load it on memory to takes the
latest modifications effect. Run load_all()
after
modification and before testing.install()
in most of the cases load_all()
is sufficient. However, sometimes the links between package components
are broken. It especially happens when you significantly change some
part of the code, and it does not pass the tests. In such cases, you may
need to run install()
.You can use any causal inference studies data to test the package. The database needs to have the following attributes:
Y
: Output valuew
: TreatmentC
: covariate matrixThe package can generate synthetic data that can be used to test different features of the package. At the current implementation, the code only generates synthetic numerical data; however, with small innovation, one can add categorical data. In the following, we present some reproducible examples that you can copy and build upon them.
library("CausalGPS")
<- generate_syn_data(sample_size = 10000)
mydata str(mydata)
#'data.frame': 10000 obs. of 8 variables:
# $ Y : num -17 3.66 -45.88 -12.12 2.75 ...
# $ treat: num 9.74 14.88 5.22 10.19 16.65 ...
# $ cf1 : num -0.128 0.155 -1.271 0.551 0.846 ...
# $ cf2 : num 0.2376 0.0466 0.8918 -1.8434 -2.4487 ...
# $ cf3 : num 1.5492 -0.4453 0.0718 0.8309 0.6749 ...
# $ cf4 : num 1.3738 0.8621 0.4735 0.7013 -0.0851 ...
# $ cf5 : num -2 2 1 -2 0 1 1 2 -2 -1 ...
# $ cf6 : num -1.794 -1.822 0.79 -0.591 2.673 ...
In the following example, first, we generate 10000 synthetic data
samples, then we feed them into estimate_gps
function to
estimate their GPS values. Please note that the Y variable is not used
inside the code, and it is only provided to cbind with the generated GPS
values. You can read more about different arguments in the documentation
(?estimate_gps
). In summary, - We want to run a code using
SuperLearner (sl
) prediction model. Inside the SuperLearner
package, we want to use the XGBoost package. As you know, we internally
generate a customized wrapper for the XGBoost package; thus, the correct
way to activate that is by passing m_xgboost
, which stands
for modified xgboost. - We requested for one thread
(nthread = 1
); you can use as much as you want; the package
will use the available one and will ignore the rest of them. XGBoost
uses OpenMP backend to use all cores. Sometimes it becomes a really big
challenge on Mac systems. So if you are using Mac and do not see any
performance improvement, be aware of that. - In params
, we
passed the list of parameters; the function will choose one from each
list at random and will generate a customize wrapper based on them. If
you want to use a specific value, just give a list one number (e.g.,
xgb_max_depth = c(3)). All parameters that start with xgb_
will only change XGBoost hyperparameters.
library("CausalGPS")
<- generate_syn_data(sample_size = 10000)
mydata <- estimate_gps(mydata$Y,
data_with_gps $treat,
mydatac("cf1","cf2","cf3","cf4","cf5","cf6")],
mydata[pred_model = "sl",
internal_use = FALSE,
params = list(xgb_max_depth = c(3,4,5),
xgb_nrounds=c(10,20,30,40,50,60)),
nthread = 1,
sl_lib = c("m_xgboost")
)
Now, let’s add some categorical data. Let’s say our data belongs to 5 different years (2000 observations per year), and each year we have data from 4 different regions, including North, South, East, and West.
library("CausalGPS")
<- generate_syn_data(sample_size = 10000)
mydata
<- c(rep(c("2001"), each=2000),
year rep(c("2002"), each=2000),
rep(c("2003"), each=2000),
rep(c("2004"), each=2000),
rep(c("2005"), each=2000))
<- rep(c(rep("North",each=500),
region rep("South",each=500),
rep("East",each=500),
rep("West",each=500)), each=5)
$year <- as.factor(year)
mydata$region <- as.factor(region)
mydata
<- estimate_gps(mydata$Y,
data_with_gps $treat,
mydatac("cf1","cf2","cf3","cf4","cf5","cf6", "year", "region")],
mydata[pred_model = "sl",
internal_use = FALSE,
params = list(xgb_max_depth = c(3,4,5),
xgb_nrounds=c(10,20,30,40,50,60)),
nthread = 1,
sl_lib = c("m_xgboost")
)
Generating pseudo population (gen_pseudo_pop()
)is one of
the important parts of the package. It internally uses
estimate_gps()
package. An acceptable pseudo population
should pass the covariate balance test. Users choose which test and what
threshold, and how many attempts. If the covariate balance test is met,
the function stops trying and returns the generated population. However,
if it cannot pass the test, it still returns what is generated with a
message to the user indicating the population did not pass the test. At
each iteration, we change the hyperparameters to modify GPS values (to
better or worse) to pass the covariate balance test. After estimating
the GPS values, we need to compile the population. There are three major
approaches to compile pseudo population, including:
Among different methods for testing covariate balance test, only
absolute
approach is implemented.
library("CausalGPS")
<- generate_syn_data(sample_size = 10000)
mydata <- generate_pseudo_pop(mydata$Y,
pseudo_pop $treat,
mydatac("cf1","cf2","cf3","cf4","cf5","cf6")],
mydata[ci_appr = "matching",
pred_model = "sl",
sl_lib = c("m_xgboost"),
params = list(xgb_nrounds=c(10,20,30),
xgb_eta=c(0.1,0.2,0.3)),
nthread = 1,
covar_bl_method = "absolute",
covar_bl_trs = 0.1,
covar_bl_trs_type= "mean",
max_attempt = 1,
matching_fun = "matching_l1",
delta_n = 1,
scale = 0.5)
The second causal inference approach is weighting
. Here
is an example to generate pseudo population using weighting
approach.
<- generate_syn_data(sample_size = 10000)
mydata
<- c(rep(c("2001"), each=2000),
year rep(c("2002"), each=2000),
rep(c("2003"), each=2000),
rep(c("2004"), each=2000),
rep(c("2005"), each=2000))
<- rep(c(rep("North",each=500),
region rep("South",each=500),
rep("East",each=500),
rep("West",each=500)), each=5)
$year <- as.factor(year)
mydata$region <- as.factor(region)
mydata
<- generate_pseudo_pop(mydata$Y,
pseudo_pop $treat,
mydatac("cf1","cf2","cf3","cf4","cf5","cf6","year","region")],
mydata[ci_appr = "weighting",
pred_model = "sl",
sl_lib = c("m_xgboost"),
params = list(xgb_nrounds=c(10,20,30),
xgb_eta=c(0.1,0.2,0.3)),
nthread = 1,
covar_bl_method = "absolute",
covar_bl_trs = 0.1,
covar_bl_trs_type = "mean",
max_attempt = 1
)
After generating a pseudo population, we can process the data for
different purposes. So far, estimating exposure rate function
(estimate_erf
) is implemented.
# library("CausalGPS")
<- generate_syn_data(sample_size = 10000)
mydata <- generate_pseudo_pop(mydata$Y,
pseudo_pop $treat,
mydatac("cf1","cf2","cf3","cf4","cf5","cf6")],
mydata[ci_appr = "matching",
pred_model = "sl",
sl_lib = c("m_xgboost"),
params = list(xgb_nrounds=c(10,20,30),
xgb_eta=c(0.1,0.2,0.3)),
nthread = 1,
covar_bl_method = "absolute",
covar_bl_trs = 0.1,
covar_bl_trs_type= "mean",
max_attempt = 1,
matching_fun = "matching_l1",
delta_n = 1,
scale = 0.5)
<- estimate_npmetric_erf(pseudo_pop$pseudo_pop$Y,
erf_val $pseudo_pop$w,
pseudo_popbw_seq=seq(0.2,2,0.2),
w_vals = seq(2,20,0.5))
The package is being tested on different data samples, some of them are generated during testing, and some others are generated before and just are loaded. These data are located in “R/sysdata.rda” file. If you add new features to the code, you may need to use some external or pre-computed data set to test your functions. In the following, we explain the steps to append data to “R/sysdata.rda”. Please note that CRAN might reject large data sets.
Step 1: Run check()
to make sure that the file as-is
satisfies all test requirements. If it raises an error, warning, or
note, please address them before modifying the data. In some rare cases,
check()
does not pass successfully, however, test() does.
This is sufficient for changing the data; however, you need to address
it before submitting a pull request.
Step 2: Create a backup data from the current data.
# in terminal
/sysdata.rda R/sysdata_backup.rda cp R
rm(list=ls())
load("R/sysdata.rd")
<- ls() list_names
list_names
. Please note
the quotes. You do not need to add data, just the name.<- c(list_names, "mydata") list_names
save(list=list_names, file="R/sysdata.rda")
check()/test()
to make sure that it passes
all tests. If it does not pass all tests, address the problem.check()/pass()
remove the
backup file.