Note that you can cite this work as:
Jolicoeur-Martineau, A., Wazana, A., Szekely, E., Steiner, M., Fleming, A. S., Kennedy, J. L., Meaney M. J. & Greenwood, C.M. (2017). Alternating optimization for GxE modelling with weighted genetic and environmental scores: examples from the MAVAN study. arXiv preprint arXiv:1703.08111.
Jolicoeur-Martineau, A., Belsky, J., Szekely, E., Widaman, K. F., Pluess, M., Greenwood, C., & Wazana, A. (2017). Distinguishing differential susceptibility, diathesis-stress and vantage sensitivity: beyond the single gene and environment model. arXiv preprint arXiv:1712.04058.
From version 1.3.0 of the LEGIT package, we introduce a function to do variable selection with elastic net within the alternating optimization framework of LEGIT. Elastic net is a regression model with a penalty term (\(\lambda\)) which penalize parameters so that they don’t become too big. As \(\lambda\) becomes bigger, certain parameters become zero which means that their corresponding variables are dropped from the model. The order in which variables are removed from the model can be interpreted as the reversed order of variable importance. Variables that are less important are removed early (when \(\lambda\) is small) and variables that are more important are removed later (only when \(\lambda\) is large). Please research “lasso” and “elastic net” if you are interested in the details of this method.
In this package, we implement Elastic Net for use on the exogenous variables inside the latent variables (e.g., in a LEGIT model, these are the genetic variants inside \(G\) and the environmental variables inside \(E\)). We also give the option to only apply Elastic Net on certain latent variables (e.g., only searching for genes in \(G\)).
Elastic Net gives us two main benefits:
It is very fast and it works much better than other approaches; we highly recommend using it. With Elastic Net, the task of choosing the genes and environments of a LEGIT model can be fully automatized.
Let’s take a quick look at a two-way example with continuous outcome:
\[\mathbf{g}_j \sim Binomial(n=1,p=.30) \\ j = 1, 2, 3, 4\] \[\mathbf{e}_l \sim Normal(\mu=0,\sigma=1.5) \\ l = 1, 2, 3\] \[\mathbf{g} = .2\mathbf{g}_1 + .15\mathbf{g}_2 - .3\mathbf{g}_3 + .1\mathbf{g}_4 + .05\mathbf{g}_1\mathbf{g}_3 + .2\mathbf{g}_2\mathbf{g}_3 \] \[ \mathbf{e} = -.45\mathbf{e}_1 + .35\mathbf{e}_2 + .2\mathbf{e}_3\] \[y = -1 + 2\mathbf{g} + 3\mathbf{e} + 4\mathbf{ge} + \epsilon \] where \(\epsilon \sim Normal(0,.5)\).
This is a standard GxE model.
set.seed(1)
library(LEGIT)
## Loading required package: formula.tools
= 500
N = example_2way(N, sigma=.5, logit=FALSE, seed=1) train
Now we will add 5 genes which are irrelevant. We expect Elastic Net to delete them first. However, note that \(\mathbf{g}_1\mathbf{g}_3\) has a very small parameter (\(.05\)) so it’s possible that it will be removed early. Let’s add the irrelevant genes and try it out.
= rbinom(N,1,.30)
g1_bad = rbinom(N,1,.30)
g2_bad = rbinom(N,1,.30)
g3_bad = rbinom(N,1,.30)
g4_bad = rbinom(N,1,.30)
g5_bad $G = cbind(train$G, g1_bad, g2_bad, g3_bad, g4_bad, g5_bad)
train= list(G=train$G, E=train$E)
lv # Elastic Net
= elastic_net_var_select(train$data, lv, y ~ G*E)
fit summary(fit)
## Lambda Model index AIC AICc BIC g1 g2 g3 g4 g1_g3
## [1,] 0.684846307 1 1302.6229 1302.8505 1332.1252 1 0 0 0 0
## [2,] 0.568571435 3 1077.8936 1078.1869 1111.6105 1 0 1 0 0
## [3,] 0.357079432 8 991.7229 992.0903 1029.6544 1 0 1 1 0
## [4,] 0.296453617 10 799.7562 800.2061 841.9023 1 1 1 1 0
## [5,] 0.140839486 18 801.6849 802.2259 848.0456 1 1 1 1 0
## [6,] 0.116927415 20 762.6244 763.2650 813.1997 1 1 1 1 0
## [7,] 0.097075195 22 764.2622 765.0112 819.0521 1 1 1 1 0
## [8,] 0.088451302 23 766.2378 767.1038 825.2423 1 1 1 1 1
## [9,] 0.060966051 27 767.5665 768.5582 830.7856 1 1 1 1 1
## [10,] 0.003105695 59 769.3589 770.4852 836.7927 1 1 1 1 1
## [11,] 0.002578402 61 769.3213 770.5910 840.9696 1 1 1 1 1
## g2_g3 g1_bad g2_bad g3_bad g4_bad g5_bad e1 e2 e3
## [1,] 0 0 0 0 0 0 1 1 1
## [2,] 0 0 0 0 0 0 1 1 1
## [3,] 0 0 0 0 0 0 1 1 1
## [4,] 0 0 0 0 0 0 1 1 1
## [5,] 0 0 0 0 0 1 1 1 1
## [6,] 1 0 0 0 0 1 1 1 1
## [7,] 1 0 1 0 0 1 1 1 1
## [8,] 1 0 1 0 0 1 1 1 1
## [9,] 1 1 1 0 0 1 1 1 1
## [10,] 1 1 1 1 0 1 1 1 1
## [11,] 1 1 1 1 1 1 1 1 1
We see that the order of variable importance is almost correct with the exception of \(\mathbf{g}_1\mathbf{g}_3\). The model with the lowest BIC and AIC is the one without the irrelevant genes and \(\mathbf{g}_1\mathbf{g}_3\).
Rather than looking in the summary, one can simply grab the best model using the function “best_model”.
best_model(fit, criterion="BIC")
## $results
## AIC AICc BIC
## 762.6244 763.2650 813.1997
##
## $fit
## $fit_main
##
## Call: stats::glm(formula = formula, family = family, data = data, model = FALSE,
## y = FALSE)
##
## Coefficients:
## (Intercept) G E G:E
## -0.8848 0.7765 5.0907 2.5355
##
## Degrees of Freedom: 499 Total (i.e. Null); 496 Residual
## Null Deviance: 5295
## Residual Deviance: 128.2 AIC: 748.6
##
## $fit_latent_var
## $fit_latent_var$G
##
## Call: stats::glm(formula = formula_step[[i]], family = family, data = data,
## model = FALSE, y = FALSE)
##
## Coefficients:
## g1 g2 g3 g4 g2_g3 g5_bad
## 0.223638 0.167358 -0.345294 0.130949 0.129199 0.003562
##
## Degrees of Freedom: 500 Total (i.e. Null); 494 Residual
## Null Deviance: 493.2
## Residual Deviance: 128.2 AIC: 752.6
##
## $fit_latent_var$E
##
## Call: stats::glm(formula = formula_step[[i]], family = family, data = data,
## model = FALSE, y = FALSE)
##
## Coefficients:
## e1 e2 e3
## -0.4337 0.3651 0.2012
##
## Degrees of Freedom: 500 Total (i.e. Null); 497 Residual
## Null Deviance: 5105
## Residual Deviance: 128.2 AIC: 746.6
##
##
## $true_model_parameters
## $true_model_parameters$AIC
## [1] 762.6244
##
## $true_model_parameters$AICc
## [1] 763.265
##
## $true_model_parameters$BIC
## [1] 813.1997
##
## $true_model_parameters$rank
## [1] 11
##
## $true_model_parameters$df.residual
## [1] 489
##
## $true_model_parameters$null.deviance
## [1] 5295.306
##
##
## attr(,"class")
## [1] "IMLEGIT"
##
## $coef
## g1 g2 g3 g4 g1_g3 g2_g3 g1_bad g2_bad g3_bad g4_bad g5_bad
## 1 1 1 1 0 1 0 0 0 0 1
## e1 e2 e3
## 1 1 1
##
## $lambda
## [1] 0.1065399
##
## $index
## [1] 21
We can also plot the coefficients of the variables over different values of \(\lambda\).
{r}{r fig1, fig.height = 5, fig.width = 5} plot(fit)
Now, we might want to look at the model with the highest cross-validation \(R^2\) (or equivalently lowest cross-validation error). We can do this easily.
= elastic_net_var_select(train$data, lv, y ~ G*E, cross_validation=TRUE, cv_iter=5, cv_folds=10)
fit summary(fit)
## Lambda Model index AIC AICc BIC cv_R2 cv_Huber
## [1,] 0.684846307 1 1302.6229 1302.8505 1332.1252 0.9247671 0.3606985
## [2,] 0.568571435 3 1077.8936 1078.1869 1111.6105 0.9519509 0.2451501
## [3,] 0.357079432 8 991.7229 992.0903 1029.6544 0.9597983 0.2096525
## [4,] 0.296453617 10 799.7562 800.2061 841.9023 0.9725635 0.1451953
## [5,] 0.140839486 18 801.6849 802.2259 848.0456 0.9723316 0.1464219
## [6,] 0.116927415 20 762.6244 763.2650 813.1997 0.9745380 0.1347863
## [7,] 0.097075195 22 764.2622 765.0112 819.0521 0.9745899 0.1345163
## [8,] 0.088451302 23 766.2378 767.1038 825.2423 0.9742795 0.1361545
## [9,] 0.060966051 27 767.5665 768.5582 830.7856 0.9743323 0.1358815
## [10,] 0.003105695 59 769.3589 770.4852 836.7927 0.9740971 0.1371124
## [11,] 0.002578402 61 769.3213 770.5910 840.9696 0.9742876 0.1361074
## cv_L1 g1 g2 g3 g4 g1_g3 g2_g3 g1_bad g2_bad g3_bad g4_bad g5_bad e1
## [1,] 0.6587195 1 0 0 0 0 0 0 0 0 0 0 1
## [2,] 0.5515774 1 0 1 0 0 0 0 0 0 0 0 1
## [3,] 0.5116525 1 0 1 1 0 0 0 0 0 0 0 1
## [4,] 0.4269982 1 1 1 1 0 0 0 0 0 0 0 1
## [5,] 0.4288122 1 1 1 1 0 0 0 0 0 0 1 1
## [6,] 0.4086760 1 1 1 1 0 1 0 0 0 0 1 1
## [7,] 0.4085227 1 1 1 1 0 1 0 1 0 0 1 1
## [8,] 0.4126709 1 1 1 1 1 1 0 1 0 0 1 1
## [9,] 0.4109779 1 1 1 1 1 1 1 1 0 0 1 1
## [10,] 0.4138107 1 1 1 1 1 1 1 1 1 0 1 1
## [11,] 0.4119581 1 1 1 1 1 1 1 1 1 1 1 1
## e2 e3
## [1,] 1 1
## [2,] 1 1
## [3,] 1 1
## [4,] 1 1
## [5,] 1 1
## [6,] 1 1
## [7,] 1 1
## [8,] 1 1
## [9,] 1 1
## [10,] 1 1
## [11,] 1 1
best_model(fit, criterion="cv_R2")
## $results
## AIC AICc BIC cv_R2 cv_Huber cv_L1
## 764.2622105 765.0111817 819.0521158 0.9745899 0.1345163 0.4085227
##
## $fit
## $fit_main
##
## Call: stats::glm(formula = formula, family = family, data = data, model = FALSE,
## y = FALSE)
##
## Coefficients:
## (Intercept) G E G:E
## -0.8856 0.7812 5.0903 2.5543
##
## Degrees of Freedom: 499 Total (i.e. Null); 496 Residual
## Null Deviance: 5295
## Residual Deviance: 128.2 AIC: 748.3
##
## $fit_latent_var
## $fit_latent_var$G
##
## Call: stats::glm(formula = formula_step[[i]], family = family, data = data,
## model = FALSE, y = FALSE)
##
## Coefficients:
## g1 g2 g3 g4 g2_g3 g2_bad g5_bad
## 0.223408 0.166512 -0.341788 0.129530 0.127745 0.008002 0.003015
##
## Degrees of Freedom: 500 Total (i.e. Null); 493 Residual
## Null Deviance: 493.3
## Residual Deviance: 128.2 AIC: 754.3
##
## $fit_latent_var$E
##
## Call: stats::glm(formula = formula_step[[i]], family = family, data = data,
## model = FALSE, y = FALSE)
##
## Coefficients:
## e1 e2 e3
## -0.4337 0.3650 0.2013
##
## Degrees of Freedom: 500 Total (i.e. Null); 497 Residual
## Null Deviance: 5105
## Residual Deviance: 128.2 AIC: 746.3
##
##
## $true_model_parameters
## $true_model_parameters$AIC
## [1] 764.2622
##
## $true_model_parameters$AICc
## [1] 765.0112
##
## $true_model_parameters$BIC
## [1] 819.0521
##
## $true_model_parameters$rank
## [1] 12
##
## $true_model_parameters$df.residual
## [1] 488
##
## $true_model_parameters$null.deviance
## [1] 5295.306
##
##
## attr(,"class")
## [1] "IMLEGIT"
##
## $coef
## g1 g2 g3 g4 g1_g3 g2_g3 g1_bad g2_bad g3_bad g4_bad g5_bad
## 1 1 1 1 0 1 0 1 0 0 1
## e1 e2 e3
## 1 1 1
##
## $lambda
## [1] 0.09707519
##
## $index
## [1] 22
We see that there is little difference in cross-validation \(R^2\) for most models, so in this case, it is not particularly meaningful.
Let say that you do not want the model selected by “best_model”, but instead want the model with index 8 instead. You can simply do:
= fit$fit[[8]] fit_mychoice
Note that you can apply Elastic only on \(G\) or \(E\) if desired.
# Elastic net only applied on G
= elastic_net_var_select(train$data, lv, y ~ G*E, c(1))
fit # Elastic net only applied on E
= elastic_net_var_select(train$data, lv, y ~ G*E, c(2)) fit
Finally, another thing to keep in mind is that the \(\lambda\) (penalty term) chosen may be badly chosen or may not be enough (if you have more than 100 variables, with the default of 100 \(\lambda\)’s, you will not see all variables being dropped one by one). There are ways to fix these issues.
If not all variables are dropped, even at high \(\lambda\), increase \(\lambda_{max}\) in the following way:
# Most E variables not removed, use lambda_mult > 1 to remove more
= elastic_net_var_select(train$data, lv, y ~ G*E, c(2), lambda_mult=5) fit
If you have too many variables and want more \(\lambda\)’s, do the following:
# Want more lambdas (useful if # of variables is large)
= elastic_net_var_select(train$data, lv, y ~ G*E, n_lambda = 200) fit