Selection of Model Parameters

Also known as feature selection in machine learning, the goal of variable selection is to identify a subset of predictors to simplify models. This can benefit model interpretation, shorten fitting time, and improve generalization (by reducing overfitting).

There are many different methods. The one that is appropriate for a given problem depends on the model type, the data, the objective and the theoretical rationale.

The parameters package implements a helper that will automatically pick a method deemed appropriate for the provided model, run the variables selection and return the optimal formula, which you can then re-use to update the model.

Simple linear regression

Fit a powerful model

If you are familiar with R and the formula interface, you know of the possibility of including a dot (.) in the formula, signifying “all the remaining variables”. Curiously, few are aware of the possibility of additionally easily adding all the interaction terms. This can be achieved using the .*. notation.

Let’s try that with the linear regression predicting Sepal.Length with the iris dataset, included by default in R.

model <- lm(Sepal.Length ~ .*., data = iris)
summary(model)
#> 
#> Call:
#> lm(formula = Sepal.Length ~ . * ., data = iris)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -0.5730 -0.2034  0.0035  0.1997  0.5702 
#> 
#> Coefficients:
#>                                Estimate Std. Error t value Pr(>|t|)  
#> (Intercept)                     3.14635    1.40272    2.24    0.027 *
#> Sepal.Width                     0.48464    0.41852    1.16    0.249  
#> Petal.Length                   -0.50461    0.87589   -0.58    0.566  
#> Petal.Width                     3.53596    1.63832    2.16    0.033 *
#> Speciesversicolor              -2.81723    1.84173   -1.53    0.129  
#> Speciesvirginica               -6.98585    3.56973   -1.96    0.053 .
#> grp2                           -0.97614    0.78226   -1.25    0.214  
#> grp3                           -0.82032    0.75569   -1.09    0.280  
#> Sepal.Width:Petal.Length        0.16021    0.23508    0.68    0.497  
#> Sepal.Width:Petal.Width        -0.73885    0.46392   -1.59    0.114  
#> Sepal.Width:Speciesversicolor  -0.06240    0.69994   -0.09    0.929  
#> Sepal.Width:Speciesvirginica    0.15652    1.06683    0.15    0.884  
#> Sepal.Width:grp2                0.21824    0.24099    0.91    0.367  
#> Sepal.Width:grp3                0.28766    0.23779    1.21    0.229  
#> Petal.Length:Petal.Width       -0.29280    0.36826   -0.80    0.428  
#> Petal.Length:Speciesversicolor  1.07990    0.54672    1.98    0.050 .
#> Petal.Length:Speciesvirginica   1.50030    0.74199    2.02    0.045 *
#> Petal.Length:grp2               0.23747    0.19687    1.21    0.230  
#> Petal.Length:grp3              -0.04170    0.17893   -0.23    0.816  
#> Petal.Width:Speciesversicolor  -0.14775    1.40148   -0.11    0.916  
#> Petal.Width:Speciesvirginica    0.48417    1.67788    0.29    0.773  
#> Petal.Width:grp2               -0.76517    0.40421   -1.89    0.061 .
#> Petal.Width:grp3                0.00258    0.39882    0.01    0.995  
#> Speciesversicolor:grp2          0.24501    0.67784    0.36    0.718  
#> Speciesvirginica:grp2           0.48118    0.94497    0.51    0.612  
#> Speciesversicolor:grp3          0.41806    0.62824    0.67    0.507  
#> Speciesvirginica:grp3           0.23647    0.87030    0.27    0.786  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.28 on 123 degrees of freedom
#> Multiple R-squared:  0.903,  Adjusted R-squared:  0.882 
#> F-statistic: 43.9 on 26 and 123 DF,  p-value: <2e-16

Wow, that’s a lot of parameters! And almost none of them is significant…

Which is weird, considering that gorgeous R2! 0.882! I wish I had that in my research…

Too many parameters?

As you might know, having a model that is too performant is not always a good thing. For instance, it can be a marker of overfitting: the model corresponds too closely to a particular set of data, and may therefore fail to predict future observations reliably. In multiple regressions, in can also fall under the Freedman’s paradox: some predictors that have actually no relation to the dependent variable being predicted will be spuriously found to be statistically significant.

Let’s run a few checks using the performance package:

library(performance)

check_normality(model)
#> OK: residuals appear as normally distributed (p = 0.115).
check_heteroscedasticity(model)
#> OK: Error variance appears to be homoscedastic (p = 0.348).
check_autocorrelation(model)
#> OK: Residuals appear to be independent and not autocorrelated (p = 0.446).
check_collinearity(model)
#> # Check for Multicollinearity
#> 
#> High Correlation
#> 
#>                 Parameter        VIF Increased SE
#>               Sepal.Width      61.39         7.84
#>              Petal.Length    4410.44        66.41
#>               Petal.Width    2876.94        53.64
#>                   Species  611985.94       782.30
#>                       grp   23782.23       154.21
#>  Sepal.Width:Petal.Length    2796.78        52.88
#>   Sepal.Width:Petal.Width    2159.27        46.47
#>       Sepal.Width:Species  461389.69       679.26
#>           Sepal.Width:grp   22277.15       149.26
#>  Petal.Length:Petal.Width    5555.71        74.54
#>      Petal.Length:Species 1872684.46      1368.46
#>          Petal.Length:grp   40317.76       200.79
#>       Petal.Width:Species  616688.46       785.30
#>           Petal.Width:grp   12613.55       112.31
#>               Species:grp  282568.74       531.57

The main issue of the model seems to be the high multicollinearity. This suggests that our model might not be able to give valid results about any individual predictor, nor tell which predictors are redundant with respect to others.

Parameters selection

Time to do some variables selection. This can be easily done using the select_parameters() function, that will automatically select the best variables and update the model accordingly. One way of using that is in a tidy pipeline (using %>%), using this output to update a new model.

lm(Sepal.Length ~ .*., data = iris) %>% 
  select_parameters() %>% 
  summary()
#> 
#> Call:
#> lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + 
#>     Species + grp + Sepal.Width:Petal.Width + Petal.Length:Species + 
#>     Petal.Length:grp + Petal.Width:grp, data = iris)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -0.6912 -0.1841 -0.0084  0.2045  0.6284 
#> 
#> Coefficients:
#>                                Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                     2.20278    0.54238    4.06  8.2e-05 ***
#> Sepal.Width                     0.83841    0.12868    6.52  1.3e-09 ***
#> Petal.Length                   -0.02234    0.28169   -0.08  0.93690    
#> Petal.Width                     1.18753    0.43904    2.70  0.00772 ** 
#> Speciesversicolor              -1.22571    0.53650   -2.28  0.02389 *  
#> Speciesvirginica               -3.30198    0.64766   -5.10  1.1e-06 ***
#> grp2                           -0.35680    0.18083   -1.97  0.05053 .  
#> grp3                            0.07139    0.18136    0.39  0.69446    
#> Sepal.Width:Petal.Width        -0.34861    0.10225   -3.41  0.00086 ***
#> Petal.Length:Speciesversicolor  0.57694    0.26857    2.15  0.03348 *  
#> Petal.Length:Speciesvirginica   0.90715    0.27149    3.34  0.00108 ** 
#> Petal.Length:grp2               0.24474    0.12851    1.90  0.05899 .  
#> Petal.Length:grp3               0.00486    0.12514    0.04  0.96909    
#> Petal.Width:grp2               -0.56926    0.29398   -1.94  0.05490 .  
#> Petal.Width:grp3                0.01301    0.29155    0.04  0.96448    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.28 on 135 degrees of freedom
#> Multiple R-squared:  0.893,  Adjusted R-squared:  0.881 
#> F-statistic: 80.1 on 14 and 135 DF,  p-value: <2e-16

That’s still a lot of parameters, but as you can see, but almost all of them are now significant, and the R2 did not change much.

Although appealing, please note that these automated selection methods are quite criticized, and should not be used in place of theoretical or hypothetical reasons (i.e., you should have justified hypotheses about the parameters of your model).

Mixed and Bayesian models

For simple linear regressions as above, the selection is made using the step() function (available in base R). This performs a stepwise selection. However, this procedures is not available for other types of models, such as mixed or Bayesian models.

Mixed models

For mixed models (of class merMod), stepwise selection is based on cAIC4::stepcAIC(). This step function only searches the “best” model based on the random effects structure, i.e. select_parameters() adds or excludes random effects until the cAIC can’t be improved further.

library(lme4)
data("qol_cancer")

# multiple models are checked, however, initial models
# already seems to be the best one...
lmer(
  QoL ~ time + phq4 + age + (1 + time | hospital / ID),
  data = qol_cancer
) %>% 
  select_parameters() %>%
  summary()
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: QoL ~ time + phq4 + age + (1 + time | ID:hospital)
#>    Data: qol_cancer
#> 
#> REML criterion at convergence: 4647
#> 
#> Scaled residuals: 
#>    Min     1Q Median     3Q    Max 
#> -3.580 -0.393  0.078  0.510  2.504 
#> 
#> Random effects:
#>  Groups      Name        Variance Std.Dev. Corr 
#>  ID:hospital (Intercept) 203.3    14.26         
#>              time         12.7     3.56    -0.72
#>  Residual                143.1    11.96         
#> Number of obs: 564, groups:  ID:hospital, 188
#> 
#> Fixed effects:
#>             Estimate Std. Error t value
#> (Intercept)  71.5309     1.6919   42.28
#> time          1.1489     0.6695    1.72
#> phq4         -4.7711     0.3236  -14.75
#> age          -0.0185     0.1889   -0.10
#> 
#> Correlation of Fixed Effects:
#>      (Intr) time   phq4  
#> time -0.844              
#> phq4  0.033 -0.027       
#> age  -0.021 -0.003  0.107

Bayesian models

For Bayesian models, select_parameters() uses the projpred package to select the best parameters for the model.

library(rstanarm)
model <- stan_glm(
  mpg ~ ., data = mtcars,
  iter = 500, refresh = 0, verbose = FALSE
)
select_parameters(model, cross_validation = TRUE)
#> stan_glm
#>  family:       gaussian [identity]
#>  formula:      mpg ~ wt
#>  observations: 32
#>  predictors:   2
#> ------
#>             Median MAD_SD
#> (Intercept) 37.3    1.8  
#> wt          -5.3    0.6  
#> 
#> Auxiliary parameter(s):
#>       Median MAD_SD
#> sigma 3.1    0.4   
#> 
#> ------
#> * For help interpreting the printed output see ?print.stanreg
#> * For info on the priors used see ?prior_summary.stanreg