This package aims at complementing the party
package with parallelization and interpretation tools.
It provides functions for :
Execute the following code within R
:
if (!require(devtools)){
install.packages('devtools')
library(devtools)
}
install_github("nicolas-robette/moreparty")
Altmann A., Toloşi L., Sander O., and Lengauer T. “Permutation importance: a corrected feature importance measure”. Bioinformatics, 26(10):1340-1347, 2010.
Apley, D. W., Zhu J. “Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models”. arXiv:1612.08468v2, 2019.
Gregorutti B., Michel B., and Saint Pierre P. “Correlation and variable importance in random forests”. arXiv:1310.5726, 2017.
Hapfelmeier A. and Ulm K. “A new variable selection approach using random forests”. Computational Statistics and Data Analysis, 60:50–69, 2013.
Hothorn T., Hornik K., Van De Wiel M.A., Zeileis A. “A lego system for conditional inference”. The American Statistician. 60:257–263, 2006.
Hothorn T., Hornik K., Zeileis A. “Unbiased Recursive Partitioning: A Conditional Inference Framework”. Journal of Computational and Graphical Statistics, 15(3):651-674, 2006.
Molnar, C. Interpretable machine learning. A Guide for Making Black Box Models Explainable, 2019. (https://christophm.github.io/interpretable-ml-book/)
Strobl, C., Malley, J., and Tutz, G. “An Introduction to Recursive Partitioning: Rationale, Application, and Characteristics of Classification and Regression Trees, Bagging, and Random Forests”. Psychological methods, 14(4):323-348, 2009.
To begin with, let’s create a binary classification problem from the iris
data set.
Now we can fit a conditional forest to the data. We use doParallel
package for parallelization, here with 2 cores. The syntax of fastcforest
function is exactly the same as cforest
from party
package, with an additional option for parallelization.
set.seed(123)
library(moreparty)
library(doParallel)
registerDoParallel(cores=2)
iris.cf = fastcforest(Species~., data=iris2, parallel=TRUE)
stopImplicitCluster()
We may now compute the variable importances.
registerDoParallel(cores=2)
vi = fastvarImp(iris.cf, measure='ACC', parallel=TRUE)
stopImplicitCluster()
rev(sort(vi))
## Petal.Width Sepal.Width Petal.Length Sepal.Length
## 0.18247273 0.14970909 0.06509091 0.01960000
Petal width and sepal width seem notably more important than the other two variables.
A surrogate tree is a simple tree that tries to approximate a more complex (and less interpretable) model, such as random forests.
## [1] 0.9269712
This surrogate tree approximates our forest’s predictions, but in a far from perfect way (R2 = 0.82), so it should probably be interpreted cautiously.
Prototypes are ‘representative’ cases of a group of data points, here versicolor vs non versicolor species, according the proximity matrix derived from the forest.
## $`FALSE`
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,] "6.75" "3.25" "5.7" "2.3"
## [2,] "5.05" "3.5" "1.45" "0.2"
## [3,] "5.1" "3.45" "1.5" "0.2"
##
## $`TRUE`
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,] "5.8" "2.8" "4.25" "1.3"
## [2,] "6.3" "2.95" "4.5" "1.45"
## [3,] "6.2" "2.95" "4.5" "1.4"
The prototypes of versicolor species all have sepal length about 5, sepal width about 3, petal length about 4.5 and petal width about 1.4. The prototypes of non versicolor species are more heterogeneous, in particular in terms of petal length and width.