CausalGPS

Installation

library("devtools")
install_github("fasrc/CausalGPS", ref="master")
library("CausalGPS")

Usage

Input parameters:

Y A vector of observed outcome variable.
w A vector of observed continuous exposure variable.
c A data.frame or matrix of observed covariates variable.
ci_appr The causal inference approach. Possible values are:
- “matching”: Matching by GPS
- “weighting”: Weighting by GPS
pred_model a prediction model (use “sl” for SuperLearner)
gps_model Model type which is used for estimating GPS value, including parametric (default) and non-parametric.
use_cov_transform If TRUE, the function uses transformer to meet the covariate balance.
transformers A list of transformers. Each transformer should be a unary function. You can pass name of customized function in the quotes.
Available transformers:
- pow2: to the power of 2
- pow3: to the power of 3
bin_seq Sequence of w (treatment) to generate pseudo population. If NULL is passed the default value will be used, which is seq(min(w)+delta_n/2,max(w), by=delta_n).
trim_quantiles A numerical vector of two. Represents the trim quantile level. Both numbers should be in the range of [0,1] and in increasing order (default: c(0.01,0.99)).
optimized_compile If TRUE, uses counts to keep track of number of replicated pseudo population.
params Includes list of params that is used internally. Unrelated parameters will be ignored.
nthread An integer value that represents the number of threads to be used by internal packages.
... Additional arguments passed to different models.

Additional parameters

Causal Inference Approach (ci.appr)

Prediction models (pred_model)

 set.seed(422)
 n <- 10000
 mydata <- generate_syn_data(sample_size=n)
 year <- sample(x=c("2001","2002","2003","2004","2005"),size = n, replace = TRUE)
 region <- sample(x=c("North", "South", "East", "West"),size = n, replace = TRUE)
 mydata$year <- as.factor(year)
 mydata$region <- as.factor(region)
 mydata$cf5 <- as.factor(mydata$cf5)
                             
 pseudo_pop <- generate_pseudo_pop(mydata$Y,
                             mydata$treat,
                             mydata[c("cf1","cf2","cf3","cf4","cf5","cf6","year","region")],
                             ci_appr = "matching",
                             pred_model = "sl",
                             gps_model = "non-parametric",
                             use_cov_transform = TRUE,
                             transformers = list("pow2", "pow3", "abs", "scale"),
                             trim_quantiles = c(0.01,0.99),
                             optimized_compile = TRUE,
                             sl_lib = c("m_xgboost"),
                             covar_bl_method = "absolute",
                             covar_bl_trs = 0.1,
                             covar_bl_trs_type = "mean",
                             max_attempt = 4,
                             matching_fun = "matching_l1",
                             delta_n = 1,
                             scale = 0.5,
                             nthread = 1)                            
                             
 plot(pseudo_pop)

matching_l1 is Manhattan distance matching approach. For prediction model we use SuperLearner package. User need to pass sl as pred_model to use SuperLearner package. SuperLearner supports different machine learning methods and packages. params is a list of hyperparameters that users can pass to the third party libraries in the SuperLearner package. All hyperparameters go into the params list. The prefixes are used to distinguished parameters for different libraries. The following table shows the external package names, their equivalent name that should be used in sl_lib, the prefixes that should be used for their hyperparameters in the params list, and available hyperparameters.

Package name sl_lib name prefix available hyperparameters
XGBoost m_xgboost xgb_ nrounds, eta, max_depth, min_child_weight
ranger m_ranger rgr_ num.trees, write.forest, replace, verbose, family

nthread is the number of available threads (cores). XGBoost needs OpenMP installed on the system to parallelize the processing.

data_with_gps <- estimate_gps(Y,
                              w,
                              c,
                              pred_model = "sl",
                              internal_use = FALSE,
                              params = list(xgb_max_depth = c(3,4,5),
                                            xgb_rounds = c(10,20,30,40)),
                              nthread = 1,                                
                              sl_lib = c("m_xgboost")
                              )

If internal_use is set to be TRUE, the program will return additional vectors to be used by the selected causal inference approach to generate a pseudo population. See ?estimate_gps for more details.

estimate_npmetric_erf<-function(matched_Y,
                                matched_w,
                                matched_counter = NULL,
                                bw_seq=seq(0.2,2,0.2),
                                w_vals,
                                nthread)
syn_data <- generate_syn_data(sample_size=1000,
                              outcome_sd = 10,
                              gps_spec = 1,
                              cova_spec = 1)

The CausalGPS package is logging internal activities into the CausalGPS.log file. The file is located in the source file location and will be appended. Users can change the logging file name (and path) and logging threshold. The logging mechanism has different thresholds (see logger package). The two most important thresholds are INFO and DEBUG levels. The former, which is the default level, logs more general information about the process. The latter, if activated, logs more detailed information that can be used for debugging purposes.