Descriptive and Graphical Analysis of the Variable and Cutpoint Selection inside Random Forests

Lennart Schneider, Achim Zeileis, Carolin Strobl

2020-04-17

Random forests are a widely used ensemble learning method for classification or regression tasks, however, they are typically used as a black box prediction method that offers only little insight into their inner workings.

In this vignette, we illustrate how the stablelearner package can be used to gain insight into this black box by visualizing and summarizing the variable and cutpoint selection of the trees within a random forest.

Recall that, in simple terms, a random forest is a tree ensemble and the forest is grown by resampling the training data and refitting trees on the resampled data. Contrary to bagging, random forests have the restriction that the number of feature variables randomly sampled as candidates at each node of a tree (in implementations, this is typically called the mtry argument) is smaller than the total number of feature variables available. Random forests were introduced by Breiman (2001).

The stablelearner package was originally designed to provide functionality for assessing the stability of tree learners and other supervised statistical learners, both visually (Philipp, Zeileis, and Strobl 2016) and by means of computing similarity measures (Philipp et al. 2018), on the basis of repeatedly resampling the training data and refitting the learner.

However, in this vignette we are interested in visualizing the variable and cutpoint selection of the trees within a random forest. Therefore, contrary to the original design of the stablelearner package, where the aim was to assess the stability of a single original tree, we are not interested in highlighting any single tree, because there simply is no original tree in a random forest and all trees should be treated as equal. As a result, some functions will later require to set the argument original = FALSE. Moreover, this vignette does not cover similarity measures for random forests, which are still work in progress.

In all sections of this vignette, we are going to work with credit data where applicants are rated as "good" or "bad", which will be introduced in Section 1.

In Section 2 we will cover the stablelearner package and how to fit a random forest using the stabletree() function (Section 2.1). In Section 2.2 we show how to summarize and visualize the variable and cutpoint selection of the trees of a random forest.

In the final Section 3, we will demonstrate how the same summary and visualizations can be produced when working with random forests that were already fitted via the cforest() function of the partykit package, the cforest() function of the party package, the randomForest() function of the randomForest package, or the ranger() function of the ranger package.

Note that in the following, functions will be specified with the double colon notation, indicating the package they belong to, i.e., stablelearner::stabletree() denoting the stabletree() function which belongs to the stablelearner package.

1 Data

In all sections we are going to work with the GermanCredit dataset, which is included in the evtree package:

data("GermanCredit", package = "evtree")

The dataset consists of 1000 observations on 21 variables. For a full description of all variables, see ?evtree::GermanCredit. The random forest we are going to fit in this vignette predict whether a person was classified as "good" or "bad" with respect to the credit_risk variable using all other available variables as feature variables. To allow for a lower runtime we only use a subsample of the data (500 persons):

set.seed(2409)
dat <- droplevels(GermanCredit[sample(1000, size = 500), ])
str(dat)
## 'data.frame':    500 obs. of  21 variables:
##  $ status                 : Factor w/ 4 levels "... < 0 DM","0 <= ... < 200 D"..
##  $ duration               : num  11 12 36 24 27 6 12 12 21 21 ...
##  $ credit_history         : Factor w/ 5 levels "no credits taken/all credits "..
##  $ purpose                : Factor w/ 10 levels "car (new)","car (used)",..: 1..
##  $ amount                 : num  1322 2214 2302 2670 8318 ...
##  $ savings                : Factor w/ 5 levels "... < 100 DM",..: 4 1 1 1 1 5 ..
##  $ employment_duration    : Ord.factor w/ 5 levels "unemployed"<"... < 1 year"..
##  $ installment_rate       : num  4 4 4 4 2 1 2 3 2 1 ...
##  $ personal_status_sex    : Factor w/ 4 levels "male : divorced/separated",..:..
##  $ other_debtors          : Factor w/ 3 levels "none","co-applicant",..: 1 1 1..
##  $ present_residence      : num  4 3 4 4 4 3 1 2 3 4 ...
##  $ property               : Factor w/ 4 levels "real estate",..: 3 2 3 3 4 1 1..
##  $ age                    : num  40 24 31 35 42 44 25 55 35 47 ...
##  $ other_installment_plans: Factor w/ 3 levels "bank","stores",..: 3 3 3 3 3 3..
##  $ housing                : Factor w/ 3 levels "rent","own","for free": 2 2 1 ..
##  $ number_credits         : num  2 1 1 1 2 2 1 1 1 2 ...
##  $ job                    : Factor w/ 4 levels "unemployed/unskilled - non-re"..
##  $ people_liable          : num  1 1 1 1 1 2 1 1 1 1 ...
##  $ telephone              : Factor w/ 2 levels "no","yes": 1 1 1 2 2 1 2 1 2 1..
##  $ foreign_worker         : Factor w/ 2 levels "yes","no": 1 1 1 1 1 1 1 2 1 1..
##  $ credit_risk            : Factor w/ 2 levels "good","bad": 1 1 2 1 2 1 2 1 1..

2 stablelearner

2.1 Growing a random forest in stablelearner

In our first approach, we want to grow a random forest directly in stablelearner. This is possible using conditional inference trees (Hothorn, Hornik, and Zeileis 2006) as base learners relying on the function ctree() of the partykit package. This procedure results in a forest similar to a random forest fitted via cforest() (see ?partykit::cforest).

To achieve this, we do have to make sure that our initial ctree, that will be repeatedly refitted on the resampled data, is specified correctly with respect to the resampling method and the number of feature variables randomly sampled as candidates at each node of a tree (argument mtry). By default, partykit::cforest() uses subsampling with a fraction of 0.632 and sets mtry = ceiling(sqrt(nvar)). In our GermanCredit example, this would be 5, as this dataset includes 20 feature variables. Note that setting mtry equal to the number of all feature variables available would result in bagging. In a real analysis mtry should be tune by means of, e.g., cross-validation.

We now fit our initial tree, mimicking the defaults of partykit::cforest() (see ?partykit::cforest and ?partykit::ctree_control for a description of the arguments teststat, testtype, mincriterion and saveinfo). The formula credit_risk ~ . simply indicates that we use all remaining variables of dat as feature variables to predict the credit_risk of a person.

set.seed(2906)
ct_partykit <- partykit::ctree(credit_risk ~ ., data = dat,
  control = partykit::ctree_control(mtry = 5, teststat = "quadratic",
    testtype = "Univariate", mincriterion = 0, saveinfo = FALSE))

We can now proceed to grow our forest based on this initial tree, using stablelearner::stabletree(). We use subsampling with a fraction of v = 0.632 and grow B = 100 trees. We set savetrees = TRUE, to be able to extract the individual trees later:

set.seed(2907)
cf_stablelearner <- stablelearner::stabletree(ct_partykit, sampler = stablelearner::subsampling,
  savetrees = TRUE, B = 100, v = 0.632)

Internally, stablelearner::stabletree() does the following: For each of the 100 trees to be generated, the dataset is resampled according to the resampling method specified (in our case subsampling with a fraction of v = 0.632) and the function call of our initial tree (which we labeled ct_partykit) is updated with respect to this resampled data and reevaluated resulting in a new tree. All the 100 trees together then build the forest.

2.2 Gaining insight into the forest

The following summary prints the variable selection frequency (freq) as well as the average number of splits in each variable (mean) over all 100 trees. As we do not want to focus on our initial tree (remember that we just grew a forest, where all trees are of equal interest), we set original = FALSE as already mentioned in the introduction:

summary(cf_stablelearner, original = FALSE)
## 
## Call:
## partykit::ctree(formula = credit_risk ~ ., data = dat, control = partykit::ctree_control(mtry = 5, 
##     teststat = "quadratic", testtype = "Univariate", mincriterion = 0, 
##     saveinfo = FALSE))
## 
## Sampler:
## B = 100 
## Method = Subsampling with 63.2% data
## 
## Variable selection overview:
## 
##                         freq mean
## status                  1.00 2.79
## duration                0.91 1.90
## credit_history          0.90 1.68
## employment_duration     0.88 1.66
## savings                 0.79 1.09
## age                     0.77 1.33
## purpose                 0.76 1.27
## installment_rate        0.73 1.23
## housing                 0.73 1.09
## property                0.70 1.05
## job                     0.68 1.01
## telephone               0.67 0.99
## personal_status_sex     0.65 0.92
## amount                  0.62 0.89
## present_residence       0.62 0.94
## other_installment_plans 0.59 0.70
## number_credits          0.57 0.79
## other_debtors           0.52 0.61
## people_liable           0.24 0.26
## foreign_worker          0.06 0.06

I.e., looking at the status variable (status of the existing checking account of a person) this variable was selected in all 100 trees (freq = 1.00). Moreover, this variable was often selected more than a single time for a split because the average number of splits is at around 2.50.

Plotting the variable selection frequency is achieved via the following (note that cex.names allows us to specify the relative font size of the x-axis labels):

barplot(cf_stablelearner, original = FALSE, cex.names = 0.7)

To get a more detailed view, we can also inspect the variable selections partitioned for each tree. The following plot shows us for each variable whether it was selected (colored in darkgrey) in each of the 100 trees within the forest, where the variables are ordered on the x-axis so that top ranking ones come first:

image(cf_stablelearner, original = FALSE, cex.names = 0.7)

This may allow for interesting observations, i.e., we observe that whenever duration was not selected, both credit_history and employment_duration were almost always selected as splitting variables.

Finally, the plot() function allows us to inspect the cutpoints and resulting partitions for each variable over all 100 trees. Here we focus on the variables status, employment_duration, and duration:

plot(cf_stablelearner, original = FALSE,
  select = c("status", "employment_duration", "duration"))

Looking at the variable status (unordered categorical variable), we are given a so called image plot visualizing the partition of this variable. We observe that the most frequent partition is ... < 0 DM and 0 <= ... < 200 DM vs. ... >= 200 DM / salary for at least 1 year and no checking account. The light gray color is used when a category was no more represented by the observations left for partitioning in the particular node.

For ordered categorical variables such as employment_duration, a barplot is given showing the frequency of all possible cutpoints sorted on the x-axis in their natural order. Here, the cutpoint between 1 <= ... < 4 years and 4 <= ... < 7 years is selected more than 80 times.

Lastly, for numerical variables a histogram is given, showing the distribution of cutpoints. We observe that most cutpoints for the variable duration occurred between 0 and 30, however there appears to be high variance.

For a more detailed explanation of the different kinds of plots, Section 3 of Philipp, Zeileis, and Strobl (2016) is very helpful.

Concluding, the summary and different plots helped us to gain better insight into the variable and cutpoint selection of the 100 trees within our forest. Finally, in case we want to extract individual trees, i.e., the first tree, we can do this via:

cf_stablelearner$tree[[1]]

It should be noted that from a technical and performance-wise aspect, there is little reason to grow a forest directly in stablelearner as the cforest() implementations in partykit and especially in party are more efficient. Nevertheless, it should be noted that the approach of growing a forest directly in stablelearner allows us to be more flexible with respect to, e.g., the resampling method, as we could specify any method we want, e.g., bootstrap, subsampling, samplesplitting, jackknife, splithalf or even custom samplers. For a discussion why subsampling should be preferred over bootstrap sampling, see Strobl et al. (2007).

3 Working with random forests fitted via other packages

In this final section we cover how to work with random forests that have already been fitted via the cforest() function of the partykit package, the cforest() function of the party package, the randomForest() function of the randomForest package, or the ranger() function of the ranger package.

Essentially, we just fit the random forest and then use stablelearner::as.stabletree() to coerce the forest to a stabletree object which allows us to get the same summary and plots as presented above.

Fitting a cforest with 100 trees using partykit is straightforward:

set.seed(2908)
cf_partykit <- partykit::cforest(credit_risk ~ ., data = GermanCredit,
  ntree = 100, mtry = 5)

stablelearner::as.stabletree() then allows us to coerce this cforest and we can produce summaries and plots as above (note that for plotting, we can now omit original = FALSE as we use a coerced forest):

cf_partykit_st <- stablelearner::as.stabletree(cf_partykit)
summary(cf_partykit_st, original = FALSE)
barplot(cf_partykit_st, cex.names = 0.7)
image(cf_partykit_st, cex.names = 0.7)
plot(cf_partykit_st, select = c("status", "employment_duration", "duration"))

We do not observe substantial differences compared to growing the forest directly in stablelearner (of course, this is the expected behavior, because we tried to mimic the algorithm of partykit::cforest() in the previous section).

This procedure described above is analogous for forests fitted via party::cforest():

set.seed(2909)
cf_party <- party::cforest(credit_risk ~ ., data = dat,
  control = party::cforest_unbiased(ntree = 100, mtry = 5))
cf_party_st <- stablelearner::as.stabletree(cf_party)
summary(cf_party_st, original = FALSE)
barplot(cf_party_st, cex.names = 0.7)
image(cf_party_st, cex.names = 0.7)
plot(cf_party_st, select = c("status", "employment_duration", "duration"))

Again, we do not observe substantial differences compared to partykit::cforest(). This is the expected behavior, as partykit::cforest() is a (pure R) reimplementation of party::cforest() (C implementation).

For forests fitted via randomForest::randomForest, we again can do the same as above. However, as these forests are not using conditional inference trees as base learners, we can expect some difference with respect to the results:

set.seed(2910)
rf <- randomForest::randomForest(credit_risk ~ ., data = dat,
  ntree = 100, mtry = 5)
rf_st <- stablelearner::as.stabletree(rf)
summary(rf_st, original = FALSE)
## 
## Call:
## randomForest(formula = credit_risk ~ ., data = dat, ntree = 100, 
##     mtry = 5)
## 
## Sampler:
## B = 100 
## Method = randomForest::randomForest
## 
## Variable selection overview:
## 
##                         freq  mean
## status                  1.00  4.81
## duration                1.00  7.23
## purpose                 1.00  7.69
## amount                  1.00 10.58
## savings                 1.00  4.54
## employment_duration     1.00  5.11
## personal_status_sex     1.00  3.98
## present_residence       1.00  4.07
## age                     1.00  9.44
## credit_history          0.99  4.21
## property                0.99  4.63
## installment_rate        0.97  4.12
## job                     0.97  3.29
## housing                 0.96  2.54
## other_installment_plans 0.94  2.36
## number_credits          0.91  2.22
## other_debtors           0.86  1.84
## telephone               0.85  1.62
## people_liable           0.78  1.44
## foreign_worker          0.27  0.27
barplot(rf_st, cex.names = 0.7)

image(rf_st, cex.names = 0.7)

plot(rf_st, select = c("status", "employment_duration", "duration"))

We observe that for numerical variables the average number of splits is much higher, i.e., amount is selected at an average of around 10 times. This is a known drawback of Breiman and Cutler’s original random forest algorithm (i.e., the algorithm is biased in splitting with respect to numerical variables), which random forests based on conditional inference trees do not share. For more details, see Hothorn, Hornik, and Zeileis (2006), Strobl et al. (2007), and Strobl, Malley, and Tutz (2009).

Finally, for forests fitted via ranger::ranger(), the procedure is again the same:

set.seed(2911)
rf_ranger <- ranger::ranger(credit_risk ~ ., data = dat,
  num.trees = 100, mtry = 5)
rf_ranger_st <- stablelearner::as.stabletree(rf_ranger)
summary(rf_ranger_st, original = FALSE)
barplot(rf_ranger_st, cex.names = 0.7)
image(rf_ranger_st, cex.names = 0.7)
plot(rf_ranger_st, select = c("status", "employment_duration", "duration"))

As a final comment on performance, note that just as stablelearner::stabletree(), stablelearner::as.stabletree() allows for parallel computation (see the arguments applyfun and cores). This may be helpful when dealing with the coercion of large random forests.

References

Breiman, L. 2001. “Random Forests.” Machine Learning 45 (1): 5–32. doi:https://doi.org/10.1023/a:1010933404324.

Hothorn, T., K. Hornik, and A. Zeileis. 2006. “Unbiased Recursive Partitioning: A Conditional Inference Framework.” Journal of Computational and Graphical Statistics 15 (3): 651–74. doi:10.1198/106186006x133933.

Philipp, M., T. Rusch, K. Hornik, and C. Strobl. 2018. “Measuring the Stability of Results from Supervised Statistical Learning.” Journal of Computational and Graphical Statistics 27 (4): 685–700. doi:10.1080/10618600.2018.1473779.

Philipp, M., A. Zeileis, and C. Strobl. 2016. “A Toolkit for Stability Assessment of Tree-Based Learners.” In Proceedings of COMPSTAT 2016 – 22nd International Conference on Computational Statistics, edited by A. Colubi, A. Blanco, and C. Gatu, 315–25. The International Statistical Institute/International Association for Statistical Computing.

Strobl, C., A.-L. Boulesteix, A. Zeileis, and T. Hothorn. 2007. “Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution.” BMC Bioinformatics 8 (25). doi:10.1186/1471-2105-8-25.

Strobl, C., J. Malley, and G. Tutz. 2009. “An Introduction to Recursive Partitioning: Rationale, Application, and Characteristics of Classification and Regression Trees, Bagging, and Random Forests.” Psychological Methods 14 (4): 323–48. doi:10.1037/a0016973.