Example of global variable importance

Anna Kozak

2020-07-02

Example of global variable importance

In this vignette, we present a global variable importance measure based on Partial Dependence Profiles (PDP) for the random forest regression model.

1 Dataset

We work on Apartments dataset from DALEX package.

#>   m2.price construction.year surface floor no.rooms    district
#> 1     5897              1953      25     3        1 Srodmiescie
#> 2     1818              1992     143     9        5     Bielany
#> 3     3643              1937      56     1        2       Praga
#> 4     3517              1995      93     7        3      Ochota
#> 5     3013              1992     144     6        5     Mokotow
#> 6     5795              1926      61     6        2 Srodmiescie

2 Random forest regression model

Now, we define a random forest regression model and use explain() function from DALEX.

library("randomForest")
apartments_rf_model <- randomForest(m2.price ~ construction.year + surface + floor +
                                      no.rooms, data = apartments)
explainer_rf <- explain(apartments_rf_model,
                        data = apartmentsTest[,2:5], y = apartmentsTest$m2.price)
#> Preparation of a new explainer is initiated
#>   -> model label       :  randomForest  (  default  )
#>   -> data              :  9000  rows  4  cols 
#>   -> target variable   :  9000  values 
#>   -> predict function  :  yhat.randomForest  will be used (  default  )
#>   -> predicted values  :  numerical, min =  2125.441 , mean =  3514.379 , max =  5326.192  
#>   -> model_info        :  package randomForest , ver. 4.6.14 , task regression (  default  ) 
#>   -> residual function :  difference between y and yhat (  default  )
#>   -> residuals         :  numerical, min =  -1225.81 , mean =  -2.855209 , max =  2162.728  
#>   A new explainer has been created! 

3 Calculate Partial Dependence Profiles

Let see the Partial Dependence Profiles calculated with DALEX::model_profile() function. The PDP also can be calculated with DALEX::variable_profile() or ingredients::partial_dependence().

profiles <- model_profile(explainer_rf)
plot(profiles) 

4 Calculate measure of global variable importance

Now, we calculated a measure of global variable importance via oscillation based on PDP.

library("vivo")
measure <- global_variable_importance(profiles)
plot(measure)

The most important variable is surface, then no.rooms, floor, and construction.year.

5 Comparison of the importance of variables for two or more models

Let created a linear regression model and explain object.

apartments_lm_model <- lm(m2.price ~ construction.year + surface + floor +
                                      no.rooms, data = apartments)
explainer_lm <- explain(apartments_lm_model,
                        data = apartmentsTest[,2:5], y = apartmentsTest$m2.price)
#> Preparation of a new explainer is initiated
#>   -> model label       :  lm  (  default  )
#>   -> data              :  9000  rows  4  cols 
#>   -> target variable   :  9000  values 
#>   -> predict function  :  yhat.lm  will be used (  default  )
#>   -> predicted values  :  numerical, min =  2231.8 , mean =  3507.346 , max =  4769.053  
#>   -> model_info        :  package stats , ver. 3.6.3 , task regression (  default  ) 
#>   -> residual function :  difference between y and yhat (  default  )
#>   -> residuals         :  numerical, min =  -733.2516 , mean =  4.177813 , max =  2107.979  
#>   A new explainer has been created! 

We calculated Partial Dependence Profiles and measure.

profiles_lm <- model_profile(explainer_lm)

measure_lm <- global_variable_importance(profiles_lm)
plot(measure_lm, measure, type = "lines")

Now we can see the order of importance of variables by model.