Different Bayesian Models

(These documents take a long time to create, so only the code is shown here. The full version is at https://tidymodels.github.io/tidyposterior.)

The data set noisy_example contains the results for a series of regression models that were created from a small dataset with considerable variability. For resampling, 10 repeats of 10-fold cross-validation were used to estimate performance. We will compare models using the root mean squared error (RMSE) metric.

library(tidyposterior)
data("noisy_example")

library(tidyverse)

rmses <- noisy_example %>%
   select(id, id2, contains("RMSE")) %>%
   setNames(tolower(gsub("_RMSE$", "", names(.))))

# By removing the `splits` column, `rmses` is a basic data frame instead of
# an `rset` object. 
stacked_rmse <- gather(rmses, key = model, value = statistic, -id, -id2)

mean_rmse <- stacked_rmse %>%
  group_by(model) %>%
  summarise(statistic = mean(statistic))

library(ggplot2)

ggplot(stacked_rmse, 
       aes(
         x = model,
         y = statistic,
         group = paste(id, id2),
         col = paste(id, id2))
       ) + 
  geom_line(alpha = .75) + 
  theme(legend.position = "none")

ggplot(stacked_rmse, aes(col = model, x = statistic)) + 
  geom_line(stat = "density", trim = FALSE) + 
  theme(legend.position = "top")

A few observations about these data:

A few different Bayesian models will be fit to these data.

A First Model

It might make sense to use a probability model that is consistent with the characteristics of the data (in terms of skewness). Instead of using a symmetric distribution for the data (such as Gaussian), a potentially right skewed probability model might make more sense. A Gamma distribution is a reasonable choice and can be fit using the generalized linear model embedded in perf_mod. This also requires a link function to be chosen to model the data. The canonical link for this distribution is the inverse transformation and this will be our choice.

To fit this model, the family argument to stan_glmer can be passed in. The default link is the inverse and no extra transformation will be used.

gamma_model <- perf_mod(rmses, family = Gamma(), seed = 74)

# Get the posterior distributions of the mean parameters:
gamma_post <- tidy(gamma_model, seed = 3750)
gamma_mean <- summary(gamma_post)
gamma_mean

Are these values consistent with the data? Let’s look at the posterior distribution and overlay the observed and predicted mean RMSE values.

ggplot(gamma_post) + 
  geom_point(data = gamma_mean, aes(y = mean), alpha = .5) + 
  geom_point(data = mean_rmse, aes(y = statistic), 
             col = "red", pch = 4, cex= 3)

The observed mean is not close to the center of the (skewed) posterior distributions. Let’s try something else.

Transforming the Data

Another approach is to transform the RMSE values to something model symmetric and model the data on a different scale. A log transform will be used here using the built-in object ln_trans. In using this option, the posterior distributions are computed on the log scale and is automatically back-transformed into the original units. By not passing family to the function, we are using a Gaussian model.

log_linear_model <- perf_mod(rmses, transform = ln_trans, seed = 74)

There were some message regarding sampling and divergent transitions. We could use the shinystan or coda packages to look into this model.

log_linear_post <- tidy(log_linear_model, seed = 3750)

log_linear_mean <- summary(log_linear_post)
log_linear_mean

ggplot(log_linear_post) + 
  geom_point(data = log_linear_mean, aes(y = mean), alpha = .5) + 
  geom_point(data = mean_rmse, aes(y = statistic), 
             col = "red", pch = 4, cex= 3)

The posteriors are a lot less skewed but the observed and estimated means are still fairly far away from one another. Since these differences are in the same direction, this would not appear to be related to the shrinkage properties of Bayesian models.

A Simple Gaussian Model

Let’s try the easiest model that used a linear function and assumes a Gaussian distirbution for the RMSE estimates.

linear_model <- perf_mod(rmses, seed = 74)

linear_post <- tidy(linear_model, seed = 3750)
linear_mean <- summary(linear_post)

ggplot(linear_post) + 
  geom_point(data = linear_mean, aes(y = mean), alpha = .5) + 
  geom_point(data = mean_rmse, aes(y = statistic), 
             col = "red", pch = 4, cex= 3)

These are right on target. Despite the skewness of the original data, a simple linear model did best here. In hindsight, this makes sense since we are modeling summary statistics as our outcome. Even if we believe these to be potentially skewed distributions, the central limit theorem is kicking in here and the estimates are tending to normality.

We can compare models using the contrast_models function. The function has arguments for two sets of models to compare but if these are left to their default (NULL), all pair-wise combinations are used. Let’s say that an RMSE difference of 1 unit is important.

all_contrasts <- contrast_models(linear_model, seed = 8967)
ggplot(all_contrasts, size = 1)
summary(all_contrasts, size = 1)

Based on our effect size of a single unit, the only pair that are practically equivalent are MARS and bagged trees. Since cubist has the smallest RMSE, it is not unreasonable to say that this model provides uniformly better results than the others shown here.

One Final Note

The Bayesian models have population parameters for the model effects (akin to “fixed” effects in mixed models) as well as variance parameter(s) related to the resamples. The posteriors computed by this package only reflect the mean parameters and should only be used to make inferences about this data set generally. This posterior calculation could not be used to predict the level of performance for a model on a new resample of the data. In this case, the variance parameters come into play and the posterior would be much wider.

In essence, the posteriors shown here are measuring the average performance value instead of a resample-specific value.