Elastic Net

Alexia Jolicoeur-Martineau

2022-01-11

Note that you can cite this work as:

Jolicoeur-Martineau, A., Wazana, A., Szekely, E., Steiner, M., Fleming, A. S., Kennedy, J. L., Meaney M. J. & Greenwood, C.M. (2017). Alternating optimization for GxE modelling with weighted genetic and environmental scores: examples from the MAVAN study. arXiv preprint arXiv:1703.08111.

Jolicoeur-Martineau, A., Belsky, J., Szekely, E., Widaman, K. F., Pluess, M., Greenwood, C., & Wazana, A. (2017). Distinguishing differential susceptibility, diathesis-stress and vantage sensitivity: beyond the single gene and environment model. arXiv preprint arXiv:1712.04058.

Elastic Net

From version 1.3.0 of the LEGIT package, we introduce a function to do variable selection with elastic net within the alternating optimization framework of LEGIT. Elastic net is a regression model with a penalty term (\(\lambda\)) which penalize parameters so that they don’t become too big. As \(\lambda\) becomes bigger, certain parameters become zero which means that their corresponding variables are dropped from the model. The order in which variables are removed from the model can be interpreted as the reversed order of variable importance. Variables that are less important are removed early (when \(\lambda\) is small) and variables that are more important are removed later (only when \(\lambda\) is large). Please research “lasso” and “elastic net” if you are interested in the details of this method.

In this package, we implement Elastic Net for use on the exogenous variables inside the latent variables (e.g., in a LEGIT model, these are the genetic variants inside \(G\) and the environmental variables inside \(E\)). We also give the option to only apply Elastic Net on certain latent variables (e.g., only searching for genes in \(G\)).

Elastic Net gives us two main benefits:

  1. the order of importance of each variable
  2. Given a criterion (AIC, BIC, cross-validation \(R^2\)), it can be used to automatically chose the best model very quickly (only comparing \(p\) models, where \(p\) is the number of variables, as opposed to \(2^p\) models).

It is very fast and it works much better than other approaches; we highly recommend using it. With Elastic Net, the task of choosing the genes and environments of a LEGIT model can be fully automatized.

Example

Let’s take a quick look at a two-way example with continuous outcome:

\[\mathbf{g}_j \sim Binomial(n=1,p=.30) \\ j = 1, 2, 3, 4\] \[\mathbf{e}_l \sim Normal(\mu=0,\sigma=1.5) \\ l = 1, 2, 3\] \[\mathbf{g} = .2\mathbf{g}_1 + .15\mathbf{g}_2 - .3\mathbf{g}_3 + .1\mathbf{g}_4 + .05\mathbf{g}_1\mathbf{g}_3 + .2\mathbf{g}_2\mathbf{g}_3 \] \[ \mathbf{e} = -.45\mathbf{e}_1 + .35\mathbf{e}_2 + .2\mathbf{e}_3\] \[y = -1 + 2\mathbf{g} + 3\mathbf{e} + 4\mathbf{ge} + \epsilon \] where \(\epsilon \sim Normal(0,.5)\).

This is a standard GxE model.

set.seed(1)
library(LEGIT)
## Loading required package: formula.tools
N = 500
train = example_2way(N, sigma=.5, logit=FALSE, seed=1)

Now we will add 5 genes which are irrelevant. We expect Elastic Net to delete them first. However, note that \(\mathbf{g}_1\mathbf{g}_3\) has a very small parameter (\(.05\)) so it’s possible that it will be removed early. Let’s add the irrelevant genes and try it out.

g1_bad = rbinom(N,1,.30)
g2_bad = rbinom(N,1,.30)
g3_bad = rbinom(N,1,.30)
g4_bad = rbinom(N,1,.30)
g5_bad = rbinom(N,1,.30)
train$G = cbind(train$G, g1_bad, g2_bad, g3_bad, g4_bad, g5_bad)
lv = list(G=train$G, E=train$E)
# Elastic Net
fit = elastic_net_var_select(train$data, lv, y ~ G*E)
summary(fit)
##            Lambda Model index       AIC      AICc       BIC g1 g2 g3 g4 g1_g3
##  [1,] 0.684846307           1 1302.6229 1302.8505 1332.1252  1  0  0  0     0
##  [2,] 0.568571435           3 1077.8936 1078.1869 1111.6105  1  0  1  0     0
##  [3,] 0.357079432           8  991.7229  992.0903 1029.6544  1  0  1  1     0
##  [4,] 0.296453617          10  799.7562  800.2061  841.9023  1  1  1  1     0
##  [5,] 0.140839486          18  801.6849  802.2259  848.0456  1  1  1  1     0
##  [6,] 0.116927415          20  762.6244  763.2650  813.1997  1  1  1  1     0
##  [7,] 0.097075195          22  764.2622  765.0112  819.0521  1  1  1  1     0
##  [8,] 0.088451302          23  766.2378  767.1038  825.2423  1  1  1  1     1
##  [9,] 0.060966051          27  767.5665  768.5582  830.7856  1  1  1  1     1
## [10,] 0.003105695          59  769.3589  770.4852  836.7927  1  1  1  1     1
## [11,] 0.002578402          61  769.3213  770.5910  840.9696  1  1  1  1     1
##       g2_g3 g1_bad g2_bad g3_bad g4_bad g5_bad e1 e2 e3
##  [1,]     0      0      0      0      0      0  1  1  1
##  [2,]     0      0      0      0      0      0  1  1  1
##  [3,]     0      0      0      0      0      0  1  1  1
##  [4,]     0      0      0      0      0      0  1  1  1
##  [5,]     0      0      0      0      0      1  1  1  1
##  [6,]     1      0      0      0      0      1  1  1  1
##  [7,]     1      0      1      0      0      1  1  1  1
##  [8,]     1      0      1      0      0      1  1  1  1
##  [9,]     1      1      1      0      0      1  1  1  1
## [10,]     1      1      1      1      0      1  1  1  1
## [11,]     1      1      1      1      1      1  1  1  1

We see that the order of variable importance is almost correct with the exception of \(\mathbf{g}_1\mathbf{g}_3\). The model with the lowest BIC and AIC is the one without the irrelevant genes and \(\mathbf{g}_1\mathbf{g}_3\).

Rather than looking in the summary, one can simply grab the best model using the function “best_model”.

best_model(fit, criterion="BIC")
## $results
##      AIC     AICc      BIC 
## 762.6244 763.2650 813.1997 
## 
## $fit
## $fit_main
## 
## Call:  stats::glm(formula = formula, family = family, data = data, model = FALSE, 
##     y = FALSE)
## 
## Coefficients:
## (Intercept)            G            E          G:E  
##     -0.8848       0.7765       5.0907       2.5355  
## 
## Degrees of Freedom: 499 Total (i.e. Null);  496 Residual
## Null Deviance:       5295 
## Residual Deviance: 128.2     AIC: 748.6
## 
## $fit_latent_var
## $fit_latent_var$G
## 
## Call:  stats::glm(formula = formula_step[[i]], family = family, data = data, 
##     model = FALSE, y = FALSE)
## 
## Coefficients:
##        g1         g2         g3         g4      g2_g3     g5_bad  
##  0.223638   0.167358  -0.345294   0.130949   0.129199   0.003562  
## 
## Degrees of Freedom: 500 Total (i.e. Null);  494 Residual
## Null Deviance:       493.2 
## Residual Deviance: 128.2     AIC: 752.6
## 
## $fit_latent_var$E
## 
## Call:  stats::glm(formula = formula_step[[i]], family = family, data = data, 
##     model = FALSE, y = FALSE)
## 
## Coefficients:
##      e1       e2       e3  
## -0.4337   0.3651   0.2012  
## 
## Degrees of Freedom: 500 Total (i.e. Null);  497 Residual
## Null Deviance:       5105 
## Residual Deviance: 128.2     AIC: 746.6
## 
## 
## $true_model_parameters
## $true_model_parameters$AIC
## [1] 762.6244
## 
## $true_model_parameters$AICc
## [1] 763.265
## 
## $true_model_parameters$BIC
## [1] 813.1997
## 
## $true_model_parameters$rank
## [1] 11
## 
## $true_model_parameters$df.residual
## [1] 489
## 
## $true_model_parameters$null.deviance
## [1] 5295.306
## 
## 
## attr(,"class")
## [1] "IMLEGIT"
## 
## $coef
##     g1     g2     g3     g4  g1_g3  g2_g3 g1_bad g2_bad g3_bad g4_bad g5_bad 
##      1      1      1      1      0      1      0      0      0      0      1 
##     e1     e2     e3 
##      1      1      1 
## 
## $lambda
## [1] 0.1065399
## 
## $index
## [1] 21

We can also plot the coefficients of the variables over different values of \(\lambda\).

{r}{r fig1, fig.height = 5, fig.width = 5} plot(fit)

Now, we might want to look at the model with the highest cross-validation \(R^2\) (or equivalently lowest cross-validation error). We can do this easily.

fit = elastic_net_var_select(train$data, lv, y ~ G*E, cross_validation=TRUE, cv_iter=5, cv_folds=10)
summary(fit)
##            Lambda Model index       AIC      AICc       BIC     cv_R2  cv_Huber
##  [1,] 0.684846307           1 1302.6229 1302.8505 1332.1252 0.9247671 0.3606985
##  [2,] 0.568571435           3 1077.8936 1078.1869 1111.6105 0.9519509 0.2451501
##  [3,] 0.357079432           8  991.7229  992.0903 1029.6544 0.9597983 0.2096525
##  [4,] 0.296453617          10  799.7562  800.2061  841.9023 0.9725635 0.1451953
##  [5,] 0.140839486          18  801.6849  802.2259  848.0456 0.9723316 0.1464219
##  [6,] 0.116927415          20  762.6244  763.2650  813.1997 0.9745380 0.1347863
##  [7,] 0.097075195          22  764.2622  765.0112  819.0521 0.9745899 0.1345163
##  [8,] 0.088451302          23  766.2378  767.1038  825.2423 0.9742795 0.1361545
##  [9,] 0.060966051          27  767.5665  768.5582  830.7856 0.9743323 0.1358815
## [10,] 0.003105695          59  769.3589  770.4852  836.7927 0.9740971 0.1371124
## [11,] 0.002578402          61  769.3213  770.5910  840.9696 0.9742876 0.1361074
##           cv_L1 g1 g2 g3 g4 g1_g3 g2_g3 g1_bad g2_bad g3_bad g4_bad g5_bad e1
##  [1,] 0.6587195  1  0  0  0     0     0      0      0      0      0      0  1
##  [2,] 0.5515774  1  0  1  0     0     0      0      0      0      0      0  1
##  [3,] 0.5116525  1  0  1  1     0     0      0      0      0      0      0  1
##  [4,] 0.4269982  1  1  1  1     0     0      0      0      0      0      0  1
##  [5,] 0.4288122  1  1  1  1     0     0      0      0      0      0      1  1
##  [6,] 0.4086760  1  1  1  1     0     1      0      0      0      0      1  1
##  [7,] 0.4085227  1  1  1  1     0     1      0      1      0      0      1  1
##  [8,] 0.4126709  1  1  1  1     1     1      0      1      0      0      1  1
##  [9,] 0.4109779  1  1  1  1     1     1      1      1      0      0      1  1
## [10,] 0.4138107  1  1  1  1     1     1      1      1      1      0      1  1
## [11,] 0.4119581  1  1  1  1     1     1      1      1      1      1      1  1
##       e2 e3
##  [1,]  1  1
##  [2,]  1  1
##  [3,]  1  1
##  [4,]  1  1
##  [5,]  1  1
##  [6,]  1  1
##  [7,]  1  1
##  [8,]  1  1
##  [9,]  1  1
## [10,]  1  1
## [11,]  1  1
best_model(fit, criterion="cv_R2")
## $results
##         AIC        AICc         BIC       cv_R2    cv_Huber       cv_L1 
## 764.2622105 765.0111817 819.0521158   0.9745899   0.1345163   0.4085227 
## 
## $fit
## $fit_main
## 
## Call:  stats::glm(formula = formula, family = family, data = data, model = FALSE, 
##     y = FALSE)
## 
## Coefficients:
## (Intercept)            G            E          G:E  
##     -0.8856       0.7812       5.0903       2.5543  
## 
## Degrees of Freedom: 499 Total (i.e. Null);  496 Residual
## Null Deviance:       5295 
## Residual Deviance: 128.2     AIC: 748.3
## 
## $fit_latent_var
## $fit_latent_var$G
## 
## Call:  stats::glm(formula = formula_step[[i]], family = family, data = data, 
##     model = FALSE, y = FALSE)
## 
## Coefficients:
##        g1         g2         g3         g4      g2_g3     g2_bad     g5_bad  
##  0.223408   0.166512  -0.341788   0.129530   0.127745   0.008002   0.003015  
## 
## Degrees of Freedom: 500 Total (i.e. Null);  493 Residual
## Null Deviance:       493.3 
## Residual Deviance: 128.2     AIC: 754.3
## 
## $fit_latent_var$E
## 
## Call:  stats::glm(formula = formula_step[[i]], family = family, data = data, 
##     model = FALSE, y = FALSE)
## 
## Coefficients:
##      e1       e2       e3  
## -0.4337   0.3650   0.2013  
## 
## Degrees of Freedom: 500 Total (i.e. Null);  497 Residual
## Null Deviance:       5105 
## Residual Deviance: 128.2     AIC: 746.3
## 
## 
## $true_model_parameters
## $true_model_parameters$AIC
## [1] 764.2622
## 
## $true_model_parameters$AICc
## [1] 765.0112
## 
## $true_model_parameters$BIC
## [1] 819.0521
## 
## $true_model_parameters$rank
## [1] 12
## 
## $true_model_parameters$df.residual
## [1] 488
## 
## $true_model_parameters$null.deviance
## [1] 5295.306
## 
## 
## attr(,"class")
## [1] "IMLEGIT"
## 
## $coef
##     g1     g2     g3     g4  g1_g3  g2_g3 g1_bad g2_bad g3_bad g4_bad g5_bad 
##      1      1      1      1      0      1      0      1      0      0      1 
##     e1     e2     e3 
##      1      1      1 
## 
## $lambda
## [1] 0.09707519
## 
## $index
## [1] 22

We see that there is little difference in cross-validation \(R^2\) for most models, so in this case, it is not particularly meaningful.

Let say that you do not want the model selected by “best_model”, but instead want the model with index 8 instead. You can simply do:

fit_mychoice = fit$fit[[8]]

Note that you can apply Elastic only on \(G\) or \(E\) if desired.

# Elastic net only applied on G
fit = elastic_net_var_select(train$data, lv, y ~ G*E, c(1))
# Elastic net only applied on E
fit = elastic_net_var_select(train$data, lv, y ~ G*E, c(2))

Finally, another thing to keep in mind is that the \(\lambda\) (penalty term) chosen may be badly chosen or may not be enough (if you have more than 100 variables, with the default of 100 \(\lambda\)’s, you will not see all variables being dropped one by one). There are ways to fix these issues.

If not all variables are dropped, even at high \(\lambda\), increase \(\lambda_{max}\) in the following way:

# Most E variables not removed, use lambda_mult > 1 to remove more
fit = elastic_net_var_select(train$data, lv, y ~ G*E, c(2), lambda_mult=5)

If you have too many variables and want more \(\lambda\)’s, do the following:

# Want more lambdas (useful if # of variables is large)
fit = elastic_net_var_select(train$data, lv, y ~ G*E, n_lambda = 200)