mcvis: Multi-collinearity Visualization

Kevin Wang, Chen Lin and Samuel Mueller

2020-06-03

Introduction

The mcvis package provides functions for detecting multi-collinearity (also known as collinearity) in linear regression. In simple terms, the mcvis method investigates variables with strong influences on collinearity in a graphical manner.

Basic usage

Suppose that we have a simple scenario that two predictors are highly correlated. This high correlation is a sufficient cause of collinearity which can be shown through large variances of estimated model parameters in linear regression.

## Simulating some data
set.seed(1)
p = 6
n = 100

X = matrix(rnorm(n*p), ncol = p)
X[,1] = X[,2] + X[,3] + rnorm(n, 0, 0.01)

y = rnorm(n)
summary(lm(y ~ X))
#> 
#> Call:
#> lm(formula = y ~ X)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -2.56042 -0.73579 -0.05585  0.86967  2.20334 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)
#> (Intercept)   0.02084    0.11157   0.187    0.852
#> X1           10.14768   10.34285   0.981    0.329
#> X2          -10.08175   10.33068  -0.976    0.332
#> X3          -10.30688   10.34038  -0.997    0.321
#> X4            0.04175    0.11321   0.369    0.713
#> X5            0.07191    0.09563   0.752    0.454
#> X6           -0.16951    0.11482  -1.476    0.143
#> 
#> Residual standard error: 1.094 on 93 degrees of freedom
#> Multiple R-squared:  0.06683,    Adjusted R-squared:  0.006628 
#> F-statistic:  1.11 on 6 and 93 DF,  p-value: 0.3625

The mcvis method highlights the major collinearity-causing variables on a bipartite graph. There are three major components of this graph: + the top row is the “tau” statistics which measure the extent of collinearity in the data. By default, only one tau statistic is shown. + the bottom row is the original variables + the two rows are linked through the MC-indices that we have developed, which are represented as lines of different shades. Darker lines implies larger values of MC-index and stronger the cause of collinearity.

If you are interested in how MC-index is calculated, our paper is published as Lin, C., Wang, K. Y. X., & Mueller, S. (2020). mcvis: A new framework for collinearity discovery, diagnostic and visualization. Journal of Computational and Graphical Statistics, In Press.

library(mcvis)
mcvis_result = mcvis(X = X)

plot(mcvis_result)


mcvis_result
#>       X01  X02  X03  X04  X05  X06
#> tau1 0.49 0.26 0.20 0.00 0.02 0.02
#> tau2 0.37 0.21 0.15 0.17 0.05 0.05
#> tau3 0.41 0.20 0.24 0.05 0.06 0.04
#> tau4 0.15 0.14 0.06 0.23 0.22 0.20
#> tau5 0.04 0.01 0.02 0.46 0.28 0.20
#> tau6 0.51 0.25 0.24 0.00 0.00 0.00

We also provide a ggplot version of the mcvis graph.

library(ggplot2)
ggplot(mcvis_result)

(Extension) why not just look at the correlation matrix?

In practice, high correlation between variables is not a necessary criterion for collinearity. In the mplot package, a simulated data was created with each of its column being a linear combination of other columns. In this case, the cause of the collinearity is not clear from the correlation matrix.

The mcvis visualisation plot identified that the 8th variable is the main cause of collinearity of this data. Upon consultation with the data generation in this simulation, we see that the x8 is a linear combination of all other predictor variables.

library(mplot)
data("artificialeg")
X = artificialeg[,1:9]
round(cor(X), 2)
#>       x1    x2    x3    x4    x5    x6    x7    x8    x9
#> x1  1.00  0.00  0.14 -0.07 -0.02 -0.37  0.46  0.36 -0.22
#> x2  0.00  1.00  0.31  0.30 -0.60  0.00 -0.29  0.24  0.53
#> x3  0.14  0.31  1.00  0.04 -0.28 -0.66 -0.08 -0.01  0.13
#> x4 -0.07  0.30  0.04  1.00 -0.48  0.01  0.02 -0.07  0.62
#> x5 -0.02 -0.60 -0.28 -0.48  1.00  0.38  0.17 -0.30 -0.75
#> x6 -0.37  0.00 -0.66  0.01  0.38  1.00  0.02 -0.50 -0.08
#> x7  0.46 -0.29 -0.08  0.02  0.17  0.02  1.00 -0.43 -0.29
#> x8  0.36  0.24 -0.01 -0.07 -0.30 -0.50 -0.43  1.00  0.27
#> x9 -0.22  0.53  0.13  0.62 -0.75 -0.08 -0.29  0.27  1.00

mcvis_result = mcvis(X)
plot(mcvis_result)

ggplot(mcvis_result)

Shiny implementation

We also offer a shiny app implementation of mcvis in our package. Suppose that we have a mcvis_result object stored in the memory of R. You can simply call the function shiny_mcvis to load up a Shiny app.

class(mcvis_result)
#> [1] "mcvis"
shiny_mcvis(mcvis_result)

Reference

Session Info

sessionInfo()
#> R version 3.6.2 (2019-12-12)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Mojave 10.14.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] C/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] mplot_1.0.4   ggplot2_3.3.1 mcvis_1.0.4  
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.1.0     xfun_0.13            purrr_0.3.4         
#>  [4] reshape2_1.4.4       lattice_0.20-41      colorspace_1.4-1    
#>  [7] vctrs_0.3.0          generics_0.0.2       htmltools_0.4.0     
#> [10] yaml_2.2.1           rlang_0.4.6          later_1.0.0         
#> [13] pillar_1.4.4         glue_1.4.1           withr_2.2.0         
#> [16] RColorBrewer_1.1-2   rngtools_1.5         doRNG_1.8.2         
#> [19] foreach_1.5.0        lifecycle_0.2.0      plyr_1.8.6          
#> [22] stringr_1.4.0        munsell_0.5.0        gtable_0.3.0        
#> [25] codetools_0.2-16     psych_1.9.12.31      evaluate_0.14       
#> [28] labeling_0.3         knitr_1.28           fastmap_1.0.1       
#> [31] httpuv_1.5.2         parallel_3.6.2       Rcpp_1.0.4.11       
#> [34] xtable_1.8-4         promises_1.1.0       scales_1.1.1        
#> [37] mime_0.9             farver_2.0.3         mnormt_1.5-6        
#> [40] digest_0.6.25        stringi_1.4.6        dplyr_1.0.0         
#> [43] shiny_1.4.0.2        grid_3.6.2           tools_3.6.2         
#> [46] magrittr_1.5         tibble_3.0.1         crayon_1.3.4        
#> [49] pkgconfig_2.0.3      ellipsis_0.3.1       shinydashboard_0.7.1
#> [52] assertthat_0.2.1     rmarkdown_2.1        iterators_1.0.12    
#> [55] R6_2.4.1             igraph_1.2.5         nlme_3.1-147        
#> [58] compiler_3.6.2