Weighted Quantile Sum (WQS) regression is a statistical model for multivariate regression in high-dimensional datasets commonly encountered in environmental exposures, epi/genomics, and metabolomic studies, among others. The model constructs a weighted index estimating the mixed effect of all predictor variables on an outcome, which may then be used in a regression model with relevant covariates to test the association of the index with a dependent variable or outcome. The contribution of each individual predictor to the overall index effect may then be assessed by the relative strength of the weights the model assigns to each variable.
The gWQS package extends WQS regression to applications with continuous and categorical outcomes and implements the random subset WQS and the repeated holdout WQS. In practical terms, the primary outputs of an analysis will be the parameter estimates and significance tests for the overall index effect of predictor variables, and the estimated weights assigned to each predictor, which identify the relevant contribution of each variable to the relationship between the WQS index and the outcome variable.
For additional theoretical background on WQS regression and its extensions, see the references provided below.
gWQS
packageThe main functions of the gWQS
package is gwqs
and gwqsrh
. The first extends WQS regression to applications with continuous, categorical and count outcomes and includes the option rs
that allows to apply a random subset implementation of WQS; the second relies on the gwqs
function and extends the method to a repeated holdout validation procedure. In this vignette we will only show the application of WQS to a continuous outcome. We created the wqs_data
dataset (available once the package is installed and loaded) to show how to use this function. These data reflect 59 exposure concentrations simulated from a distribution of 34 PCB exposures and 25 phthalate biomarkers measured in subjects participating in the NHANES study (2001-2002). Additionally, 8 outcome measures were simulated applying different distributions and fixed beta coefficients to the predictors. In particular y
and yLBX
were simulated from a normal distribution, ybin
and ybinLBX
from a binomial distribution, ymultinom
and ymultinomLBX
from a multinomial distribution and ycount
and ycountLBX
from a Poisson distribution. The sex
variable was also simulated to allow to adjust for a covariate in the model. This dataset can thus be used to test the gWQS
package by analyzing the mixed effect of the simulated chemicals on the different outcomes, with adjustments for covariates.
The following script calls a WQS model for a continuous outcome using the function gwqs
that returns an object of class gwqs
; the three functions gwqs_barplot
, gwqs_scatterplot
and gwqs_fitted_vs_resid
allows to plot the figures shown in figure @ref(fig:model1):
# we save the names of the mixture variables in the variable "PCBs"
PCBs <- names(wqs_data)[1:34]
# we run the model and save the results in the variable "results"
results <- gwqs(yLBX ~ wqs, mix_name = PCBs, data = wqs_data,
q = 10, validation = 0.6, b = 2, b1_pos = TRUE,
b1_constr = FALSE, family = "gaussian", seed = 2016)
# bar plot
gwqs_barplot(results)
# scatter plot y vs wqs
gwqs_scatterplot(results)
# scatter plot residuals vs fitted values
gwqs_fitted_vs_resid(results)
This WQS model tests the relationship between our dependent variable, y
, and a WQS index estimated from ranking exposure concentrations in deciles (q = 10
); in the gwqs
formula the wqs
term must be included as if a wqs
variable was present in the dataset. The data were divided in 40% of the dataset for training and 60% for validation (validation = 0.6
), and 2 bootstrap samples (b = 2
) for parameter estimation were assigned (in practical applications we suggest at least 100 bootstrap samples to be used). Because WQS provides a unidirectional evaluation of mixture effects, we first examined weights derived from bootstrap models where \(\beta_1\) was positive (b1_pos = TRUE
); we could test for negative associations by setting that parameter to be false (b1_pos = FALSE
). We can also choose to constrain the \(\beta_1\) to be positive (b1_pos = TRUE
and b1_constr = TRUE
) or negative (b1_pos = FALSE
and b1_constr = TRUE
) when we estimate the weights; in the case of example 1 we are not applying a constraint to \(\beta_1\). We linked our model to a gaussian distribution to test for relationships between the continuous outcome and exposures (family = "gaussian"
), and fixed the seed to 2016 for reproducible results (seed = 2016
).
Figure @ref(fig:model1) A is a barplot showing the weights assigned to each variable ordered from the highest weight to the lowest. These results indicate that the variables LBXF07LA
, LBXD02LA
and LBX138LA
are the largest contributors to this mixture effect. The dashed red line represents the cutoff \(\tau\) (by default equal to the inverse of the number of elements in the mixture as suggested in Carrico et al. 2014) to discriminate which element has a significant weight greater than zero.
In plot B of figure @ref(fig:model1) we have a representation of the wqs index vs the outcome (adjusted for the model residual when covariates are included in the model) that shows the direction and the shape of the association between the exposure and the outcome. For example, in this case we can observe a linear and positive relationship between the mixture and the yLBX
variable.
In plot C a diagnostic graph of the residuals vs the fitted values is shown to check if they are randomly spread around zero or if there is a trend. All these plots are built using the ggplot2
package.
To test the statistical significance of the association between the variables in the model, the following code has to be run as for a classical R
regression function:
summary(results)
##
## Call:
## gwqs(formula = yLBX ~ wqs, data = wqs_data, mix_name = PCBs,
## b = 2, b1_pos = TRUE, b1_constr = FALSE, q = 10, validation = 0.6,
## family = "gaussian", seed = 2016)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9642 -0.6802 -0.0257 0.6717 3.4734
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.49333 0.31386 -14.32 <2e-16 ***
## wqs 1.01991 0.06753 15.10 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.210009)
##
## Null deviance: 636.62 on 299 degrees of freedom
## Residual deviance: 360.58 on 298 degrees of freedom
## AIC: 912.54
##
## Number of Fisher Scoring iterations: 2
This result tells us that the association is positive and statistically significant (p<2e-16
).
To have the exact values of the estimated weights we can apply the command results$final_weights
. The following code shows the first six highest weights; the full list of weights can be called by omitting the head function:
head(results$final_weights)
## mix_name mean_weight
## LBX138LA LBX138LA 0.16536064
## LBXF07LA LBXF07LA 0.16257973
## LBXD02LA LBXD02LA 0.13392939
## LBX105LA LBX105LA 0.09774876
## LBXF06LA LBXF06LA 0.05016536
## LBX118LA LBX118LA 0.04785201
These same tables are also shown in the Viewer window through the functions gwqs_summary_tab
and gwqs_weights_tab
respectively. Both these two functions use the package kableExtra
to produce the output. The output (table @ref(tab:sum1) and @ref(tab:w1)) and respective code is shown below:
gwqs_summary_tab(results)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | -4.49 | 0.3140 | -14.3 | 0 |
wqs | 1.02 | 0.0675 | 15.1 | 0 |
mf_df <- as.data.frame(signif(coef(summary(results$fit)), 3))
kable_styling(kable(mf_df, row.names = TRUE))
gwqs_weights_tab(results)
mix_name | mean_weight |
---|---|
LBX138LA | 1.65e-01 |
LBXF07LA | 1.63e-01 |
LBXD02LA | 1.34e-01 |
LBX105LA | 9.77e-02 |
LBXF06LA | 5.02e-02 |
LBX118LA | 4.79e-02 |
LBX153LA | 4.21e-02 |
LBX180LA | 3.81e-02 |
LBX157LA | 3.80e-02 |
LBXF05LA | 3.78e-02 |
LBXD01LA | 3.05e-02 |
LBX167LA | 2.75e-02 |
LBX194LA | 2.48e-02 |
LBXF03LA | 1.91e-02 |
LBXF08LA | 1.58e-02 |
LBX099LA | 1.46e-02 |
LBX156LA | 1.36e-02 |
LBX187LA | 1.14e-02 |
LBX170LA | 1.05e-02 |
LBX199LA | 5.97e-03 |
LBXF02LA | 5.40e-03 |
LBX189LA | 5.28e-03 |
LBXPCBLA | 1.46e-03 |
LBX074LA | 2.83e-04 |
LBXTCDLA | 1.84e-05 |
LBXF09LA | 0.00e+00 |
LBXD03LA | 0.00e+00 |
LBXHXCLA | 0.00e+00 |
LBXD05LA | 0.00e+00 |
LBXD07LA | 0.00e+00 |
LBXF04LA | 0.00e+00 |
LBXF01LA | 0.00e+00 |
LBXD04LA | 0.00e+00 |
LBX196LA | 0.00e+00 |
final_weight <- results$final_weights
final_weight[, -1] <- signif(final_weight[, -1], 3)
kable_styling(kable(final_weight, row.names = FALSE))
The gwqs
function gives back other outputs like the vector of the values that indicate whether the solver has converged (0) or not (1) (results$conv
), the matrix with all the estimated weights and the associated \(\beta_1\), standard errors, statistics and p-values for each bootstrap sample (results$bres
), the vector of the estimated wqs
index (results$wqs
), the list of vectors containing the cutoffs used to determine the quantiles of each variable in the mixture (results$qi
), the list of vectors containing the rows of the subjects included in each bootstrap dataset (results$bindex
), the rows identifying the subjects used to estimate the weights in each bootstrap (results$tindex
), the rows identifying the subjects used to estimate the parameters of the final model (results$vindex
), the vector of the values of the objective function at the optima parameter estimates obtained at each bootstrap step (results$objfn_values
) and any messages from the optim
function (results$optim_messages
).
The following script allows to reproduce the figures that are automatically generated using the plots functions:
# bar plot
w_ord <- order(results$final_weights$mean_weight)
mean_weight <- results$final_weights$mean_weight[w_ord]
mix_name <- factor(results$final_weights$mix_name[w_ord],
levels = results$final_weights$mix_name[w_ord])
data_plot <- data.frame(mean_weight, mix_name)
ggplot(data_plot, aes(x = mix_name, y = mean_weight)) +
geom_bar(stat = "identity", color = "black") + theme_bw() +
theme(axis.ticks = element_blank(),
axis.title = element_blank(),
axis.text.x = element_text(color='black'),
legend.position = "none") + coord_flip() +
geom_hline(yintercept = 1/length(PCBs), linetype="dashed", color = "red")
#
# scatter plot y vs wqs
ggplot(results$y_wqs_df, aes(wqs, y_adj)) + geom_point() +
stat_smooth(method = "loess", se = FALSE, size = 1.5) + theme_bw()
#
# scatter plot residuals vs fitted values
fit_df <- data.frame(fitted = fitted(results),
resid = residuals(results, type = "response"))
ggplot(fit_df, aes(x = fitted, y = resid)) + geom_point() +
theme_bw() + xlab("Fitted values") + ylab("Residuals")
Carrico C, Gennings C, Wheeler D, Factor-Litvak P. Characterization of a weighted quantile sum regression for highly correlated data in a risk analysis setting. J Agricul Biol Environ Stat. 2014:1-21. ISSN: 1085-7117. DOI: 10.1007/ s13253-014-0180-3. http://dx.doi.org/10.1007/s13253-014-0180-3.
Czarnota J, Gennings C, Colt JS, De Roos AJ, Cerhan JR, Severson RK, Hartge P, Ward MH, Wheeler D. 2015. Analysis of environmental chemical mixtures and non-Hodgkin lymphoma risk in the NCI-SEER NHL study. Environmental Health Perspectives.
Czarnota J, Gennings C, Wheeler D. 2015. Assessment of weighted quantile sum regression for modeling chemical mixtures and cancer risk. Cancer Informatics, 2015:14(S2) 159-171.
Curtin P, Kellogg J, Cech N, and Gennings C. A random subset implementation of weighted quantile sum (wqsrs) regression for analysis of high-dimensional mixtures. Communications in Statistics - Simulation and Computation, 0(0):1–16, 2019. doi: 10.1080/03610918.2019.1577971.
Tanner EM, Bornehag CG, and Gennings C. Repeated holdout validation for weighted quantile sum regression. MethodsX, 6:2855 – 2860, 2019. doi: https://doi.org/10.1016/j.mex.2019.11.008.
This package was developed at the CHEAR Data Center (Dept. of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai) with funding and support from NIEHS (U2C ES026555-01) with additional support from the Empire State Development Corporation.