The mcvis
package provides functions for detecting multi-collinearity (also known as collinearity) in linear regression. In simple terms, the mcvis
method investigates variables with strong influences on collinearity in a graphical manner.
Suppose that we have a simple scenario that two predictors are highly correlated. This high correlation is a sufficient cause of collinearity which can be shown through large variances of estimated model parameters in linear regression.
## Simulating some data
set.seed(1)
p = 6
n = 100
X = matrix(rnorm(n*p), ncol = p)
X[,1] = X[,2] + X[,3] + rnorm(n, 0, 0.01)
y = rnorm(n)
summary(lm(y ~ X))
#>
#> Call:
#> lm(formula = y ~ X)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.56042 -0.73579 -0.05585 0.86967 2.20334
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.02084 0.11157 0.187 0.852
#> X1 10.14768 10.34285 0.981 0.329
#> X2 -10.08175 10.33068 -0.976 0.332
#> X3 -10.30688 10.34038 -0.997 0.321
#> X4 0.04175 0.11321 0.369 0.713
#> X5 0.07191 0.09563 0.752 0.454
#> X6 -0.16951 0.11482 -1.476 0.143
#>
#> Residual standard error: 1.094 on 93 degrees of freedom
#> Multiple R-squared: 0.06683, Adjusted R-squared: 0.006628
#> F-statistic: 1.11 on 6 and 93 DF, p-value: 0.3625
The mcvis
method highlights the major collinearity-causing variables on a bipartite graph. There are three major components of this graph: + the top row is the “tau” statistics which measure the extent of collinearity in the data. By default, only one tau statistic is shown. + the bottom row is the original variables + the two rows are linked through the MC-indices that we have developed, which are represented as lines of different shades. Darker lines implies larger values of MC-index and stronger the cause of collinearity.
If you are interested in how MC-index is calculated, our paper is published as Lin, C., Wang, K. Y. X., & Mueller, S. (2020). mcvis: A new framework for collinearity discovery, diagnostic and visualization. Journal of Computational and Graphical Statistics, In Press.
mcvis_result
#> X01 X02 X03 X04 X05 X06
#> tau1 0.49 0.26 0.20 0.00 0.02 0.02
#> tau2 0.37 0.21 0.15 0.17 0.05 0.05
#> tau3 0.41 0.20 0.24 0.05 0.06 0.04
#> tau4 0.15 0.14 0.06 0.23 0.22 0.20
#> tau5 0.04 0.01 0.02 0.46 0.28 0.20
#> tau6 0.51 0.25 0.24 0.00 0.00 0.00
We also provide a ggplot version of the mcvis graph.
In practice, high correlation between variables is not a necessary criterion for collinearity. In the mplot
package, a simulated data was created with each of its column being a linear combination of other columns. In this case, the cause of the collinearity is not clear from the correlation matrix.
The mcvis
visualisation plot identified that the 8th variable is the main cause of collinearity of this data. Upon consultation with the data generation in this simulation, we see that the x8 is a linear combination of all other predictor variables.
library(mplot)
data("artificialeg")
X = artificialeg[,1:9]
round(cor(X), 2)
#> x1 x2 x3 x4 x5 x6 x7 x8 x9
#> x1 1.00 0.00 0.14 -0.07 -0.02 -0.37 0.46 0.36 -0.22
#> x2 0.00 1.00 0.31 0.30 -0.60 0.00 -0.29 0.24 0.53
#> x3 0.14 0.31 1.00 0.04 -0.28 -0.66 -0.08 -0.01 0.13
#> x4 -0.07 0.30 0.04 1.00 -0.48 0.01 0.02 -0.07 0.62
#> x5 -0.02 -0.60 -0.28 -0.48 1.00 0.38 0.17 -0.30 -0.75
#> x6 -0.37 0.00 -0.66 0.01 0.38 1.00 0.02 -0.50 -0.08
#> x7 0.46 -0.29 -0.08 0.02 0.17 0.02 1.00 -0.43 -0.29
#> x8 0.36 0.24 -0.01 -0.07 -0.30 -0.50 -0.43 1.00 0.27
#> x9 -0.22 0.53 0.13 0.62 -0.75 -0.08 -0.29 0.27 1.00
mcvis_result = mcvis(X)
plot(mcvis_result)
We also offer a shiny app implementation of mcvis
in our package. Suppose that we have a mcvis_result
object stored in the memory of R
. You can simply call the function shiny_mcvis
to load up a Shiny app.
sessionInfo()
#> R version 3.6.2 (2019-12-12)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Mojave 10.14.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] C/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] mplot_1.0.4 ggplot2_3.3.1 mcvis_1.0.4
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.1.0 xfun_0.13 purrr_0.3.4
#> [4] reshape2_1.4.4 lattice_0.20-41 colorspace_1.4-1
#> [7] vctrs_0.3.0 generics_0.0.2 htmltools_0.4.0
#> [10] yaml_2.2.1 rlang_0.4.6 later_1.0.0
#> [13] pillar_1.4.4 glue_1.4.1 withr_2.2.0
#> [16] RColorBrewer_1.1-2 rngtools_1.5 doRNG_1.8.2
#> [19] foreach_1.5.0 lifecycle_0.2.0 plyr_1.8.6
#> [22] stringr_1.4.0 munsell_0.5.0 gtable_0.3.0
#> [25] codetools_0.2-16 psych_1.9.12.31 evaluate_0.14
#> [28] labeling_0.3 knitr_1.28 fastmap_1.0.1
#> [31] httpuv_1.5.2 parallel_3.6.2 Rcpp_1.0.4.11
#> [34] xtable_1.8-4 promises_1.1.0 scales_1.1.1
#> [37] mime_0.9 farver_2.0.3 mnormt_1.5-6
#> [40] digest_0.6.25 stringi_1.4.6 dplyr_1.0.0
#> [43] shiny_1.4.0.2 grid_3.6.2 tools_3.6.2
#> [46] magrittr_1.5 tibble_3.0.1 crayon_1.3.4
#> [49] pkgconfig_2.0.3 ellipsis_0.3.1 shinydashboard_0.7.1
#> [52] assertthat_0.2.1 rmarkdown_2.1 iterators_1.0.12
#> [55] R6_2.4.1 igraph_1.2.5 nlme_3.1-147
#> [58] compiler_3.6.2