This vignette needs some love.
Let’s start off with a simple example. We will simulate 100 observations of a normally distributed outcome variable (mean = 0 and SD = 1) and a grouping variable made of zeros and ones. Importantly, the mean difference between these two groups (zeros vs. ones) is of 1.
data <- bayestestR::simulate_difference(n = 100,
d = 1,
names = c("Group", "Outcome"))
summary(data)
> Group Outcome
> 0:50 Min. :-2.55
> 1:50 1st Qu.:-0.74
> Median : 0.00
> Mean : 0.00
> 3rd Qu.: 0.74
> Max. : 2.55
Now, as we are interested in the difference between these two groups, we can first investigate it using a t-test.
library(ggplot2)
ggplot(data, aes(x=Group, y=Outcome, fill=Group)) +
geom_boxplot() +
see::theme_modern()
ttest <- t.test(Outcome ~ Group, data=data, var.equal=TRUE)
ttest_pars <- parameters::parameters(ttest)
ttest_pars
> Parameter | Group | Mean_Group1 | Mean_Group2 | Difference | t | df | p | 95% CI | Method
> ----------------------------------------------------------------------------------------------------------------------
> Outcome | Group | -0.50 | 0.50 | 1.00 | -5.32 | 98 | < .001 | [-1.37, -0.63] | Two Sample t-test
As we can see, this confirms our simulation specifications, the difference is indeed of 1.
Let’s compute now, a traditional Cohen’s d using the effectsize
package. While this d should be close to 1, it should theoretically be a tiny bit larger, because it takes into account the (pooled) SD of the whole variable x
(across the groups), which because of the difference is a bit larger than 1.
> [1] 1.1
We can compute the Cohen’s d as follows:
> Cohen's d | 95% CI
> --------------------------
> -1.06 | [-1.48, -0.64]
As expected, it’s pretty close to 1 times the SD of the sample. Interestingly, one can estimate the Cohen’s d directly from the result of the t-test, using the t statistic. We can convert it to a d using the effectsize
package:
> d | 95% CI
> ----------------------
> -1.07 | [-1.50, -0.65]
Fortunately, they are quite close.
Another way of investigating these differences is through the lens of a logistic regression. The main difference is that here, the group variable y
becomes the outcome and x
the predictor. Let’s fit such model and investigate the parameters:
> Parameter | Coefficient | SE | 95% CI | z | p
> --------------------------------------------------------------------
> (Intercept) | 1.23e-16 | 0.23 | [-0.45, 0.45] | 5.41e-16 | > .999
> Outcome | 1.13 | 0.27 | [ 0.65, 1.70] | 4.25 | < .001
How to interpret this output? The coefficients of a logistic model are expressed in log-odds, which is a metric of probability. Using the modelbased package, one can easily visualize this model:
data_grid <- modelbased::estimate_link(model)
ggplot(data_grid, aes(x = Outcome, y = Predicted)) +
geom_ribbon(aes(ymin = CI_low, ymax = CI_high), alpha = 0.2) +
geom_line(color = "red", size = 1) +
see::theme_modern()
We can see that the probability of y
being 1 (vs. 0) increases as x
increases. This is another way of saying that there is a difference of x
between the two groups of y
. We can visualize all of our this together as follows:
ggplot(data, aes(x=Group, y=Outcome, fill=Group)) +
geom_boxplot() +
geom_jitter(width = 0.1) +
# add vertical regression line
geom_line(data = data_grid,
aes(x = Predicted + 1, y = Outcome, fill = NA),
color = "red", size = 1) +
see::theme_modern()
You can notice that the red predicted probability line passes through x=0
when y=0.5
. This means that when x=0
, the probability of the two groups is equal: it is the “middle” of the difference between them.