The mcvis
package provides functions for detecting multi-collinearity (also known as collinearity) in linear regression. In simple terms, the mcvis
method investigates variables with strong influences on collinearity in a graphical manner.
Suppose that we have a simple scenario that one predictor \(X_1\) is almost linearly dependent on another two predictors \(X_2\) and \(X_3\), thus \(X_1\) is strongly correlated with these two predictors. The dependence among these three variables is a sufficient cause for collinearity which can be shown through large variances of estimated model parameters in linear regression. We illustrate this with a simple example:
## Simulating some data
set.seed(1)
= 6
p = 100
n
= matrix(rnorm(n*p), ncol = p)
X 1] = X[,2] + X[,3] + rnorm(n, 0, 0.01)
X[,
= rnorm(n)
y summary(lm(y ~ X))
#>
#> Call:
#> lm(formula = y ~ X)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.56042 -0.73579 -0.05585 0.86967 2.20334
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.02084 0.11157 0.187 0.852
#> X1 10.14768 10.34285 0.981 0.329
#> X2 -10.08175 10.33068 -0.976 0.332
#> X3 -10.30688 10.34038 -0.997 0.321
#> X4 0.04175 0.11321 0.369 0.713
#> X5 0.07191 0.09563 0.752 0.454
#> X6 -0.16951 0.11482 -1.476 0.143
#>
#> Residual standard error: 1.094 on 93 degrees of freedom
#> Multiple R-squared: 0.06683, Adjusted R-squared: 0.006628
#> F-statistic: 1.11 on 6 and 93 DF, p-value: 0.3625
The mcvis
method highlights the major collinearity-causing variables on a bipartite graph. There are three major components of this graph: + the top row renders the “tau” statistics and by default, only one tau statistic is shown (\(\tau_p\), where \(p\) is the number of predictors). This tau statistic measures the extent of collinearity in the data and relates to the eigenvalues of the correlation matrix in \(X\). + the bottom row renders all original predictors. + the two rows are linked through the MC-indices that we have developed, which are represented as lines of different shades and thickness. Darker lines implies larger values of the MC-index indicate what predictor contribute more to causing collinearity.
If you are interested in how the tau statistics and the resampling-based MC-index are calculated, our paper is published as Lin, C., Wang, K. Y. X., & Mueller, S. (2020). mcvis: A new framework for collinearity discovery, diagnostic and visualization. Journal of Computational and Graphical Statistics
library(mcvis)
= mcvis(X = X)
mcvis_result
mcvis_result#> X1 X2 X3 X4 X5 X6
#> tau6 0.51 0.25 0.24 0 0 0
This matrix of MC-indices is the main numeric output of mcvis
and our visualisation techniques are focused on visualising this matrix. Below is a bipartite graph visualising the last row of this matrix, which is the main visualisation plot of mcvis
.
plot(mcvis_result)
We also provide a igraph version of the mcvis bipartite graph.
plot(mcvis_result, type = "igraph")
In practice, high correlation between variables is not a necessary criterion for collinearity. In the mplot
package (Tarr et. al. 2018), a simulated data was created with many of its columns being a linear combination of other columns plus noise. In this case, the cause of the collinearity is not clear from the correlation matrix.
The mcvis
visualisation plot identified that the 8th variable (x8) is the main cause of collinearity of this data. Upon consultation with the data generation in this simulation, we see that x8 is a linear combination of all other predictor variables (plus noise). This knowledge can provide important guidance to statistical interpretations of model selection results.
## Simulation taken from the `mplot` package.
## Generating a data with multi-collinearity.
=50
nset.seed(8) # a seed of 2 also works
= rnorm(n,0.22,2)
x1 = 0.5*x1 + rnorm(n,0,sd=2)
x7 = -0.75*x1 + rnorm(n,0,3)
x6 = -0.5-0.5*x6 + rnorm(n,0,2)
x3 = rnorm(n,0.6,3.5)
x9 = 0.5*x9 + rnorm(n,0,sd=3)
x4 = -0.5 + 0.5*x9 + rnorm(n,0,sd=2)
x2 = -0.5*x2+0.5*x3+0.5*x6-0.5*x9+rnorm(n,0,1.5)
x5 = x1 + x2 -2*x3 - 0.3*x4 + x5 - 1.6*x6 - 1*x7 + x9 +rnorm(n,0,0.5)
x8 = 0.6*x8 + rnorm(n,0,2)
y = round(data.frame(x1,x2,x3,x4,x5,x6,x7,x8,x9,y),1) artificialeg
= artificialeg[,1:9]
X round(cor(X), 2)
#> x1 x2 x3 x4 x5 x6 x7 x8 x9
#> x1 1.00 0.00 0.14 -0.07 -0.02 -0.37 0.46 0.36 -0.22
#> x2 0.00 1.00 0.31 0.30 -0.60 0.00 -0.29 0.24 0.53
#> x3 0.14 0.31 1.00 0.04 -0.28 -0.66 -0.08 -0.01 0.13
#> x4 -0.07 0.30 0.04 1.00 -0.48 0.01 0.02 -0.07 0.62
#> x5 -0.02 -0.60 -0.28 -0.48 1.00 0.38 0.17 -0.30 -0.75
#> x6 -0.37 0.00 -0.66 0.01 0.38 1.00 0.02 -0.50 -0.08
#> x7 0.46 -0.29 -0.08 0.02 0.17 0.02 1.00 -0.43 -0.29
#> x8 0.36 0.24 -0.01 -0.07 -0.30 -0.50 -0.43 1.00 0.27
#> x9 -0.22 0.53 0.13 0.62 -0.75 -0.08 -0.29 0.27 1.00
= mcvis(X)
mcvis_result
mcvis_result#> x1 x2 x3 x4 x5 x6 x7 x8 x9
#> tau9 0.01 0.01 0.29 0 0.03 0.31 0.02 0.32 0.02
plot(mcvis_result)
We also offer a shiny app implementation of mcvis
in our package. Suppose that we have a mcvis_result
object stored in the memory of R
. You can simply call the function shiny_mcvis
to load up a Shiny app.
class(mcvis_result)
#> [1] "mcvis"
shiny_mcvis(mcvis_result)
Lin, C., Wang, K. Y. X., & Mueller, S. (2020). mcvis: A new framework for collinearity discovery, diagnostic and visualization. Journal of Computational and Graphical Statistics, In Press. URL: https://doi.org/10.1080/10618600.2020.1779729
Tarr G, Mueller S, Welsh AH (2018). mplot: An R Package for Graphical Model Stability and Variable Selection Procedures. Journal of Statistical Software, 83(9), 1-28. URL: https://doi.org/10.18637/jss.v083.i09
sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Mojave 10.14.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] C/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] mcvis_1.0.8
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.1.1 xfun_0.24 bslib_0.2.5.1 purrr_0.3.4
#> [5] reshape2_1.4.4 lattice_0.20-41 colorspace_2.0-0 vctrs_0.3.8
#> [9] generics_0.1.0 htmltools_0.5.1.1 yaml_2.2.1 utf8_1.2.2
#> [13] rlang_0.4.11 jquerylib_0.1.4 pillar_1.6.1 later_1.2.0
#> [17] glue_1.4.2 DBI_1.1.1 RColorBrewer_1.1-2 lifecycle_1.0.0
#> [21] plyr_1.8.6 stringr_1.4.0 munsell_0.5.0 gtable_0.3.0
#> [25] psych_2.1.3 evaluate_0.14 labeling_0.4.2 knitr_1.33
#> [29] fastmap_1.1.0 httpuv_1.6.1 parallel_4.0.3 fansi_0.5.0
#> [33] highr_0.9 Rcpp_1.0.6 xtable_1.8-4 scales_1.1.1
#> [37] promises_1.2.0.1 jsonlite_1.7.2 tmvnsim_1.0-2 farver_2.1.0
#> [41] mime_0.11 mnormt_2.0.2 ggplot2_3.3.3 digest_0.6.27
#> [45] stringi_1.7.3 dplyr_1.0.6.9000 shiny_1.6.0 grid_4.0.3
#> [49] tools_4.0.3 magrittr_2.0.1 sass_0.4.0 tibble_3.1.3
#> [53] crayon_1.4.1 pkgconfig_2.0.3 ellipsis_0.3.2 assertthat_0.2.1
#> [57] rmarkdown_2.9 R6_2.5.0 igraph_1.2.6 nlme_3.1-152
#> [61] compiler_4.0.3