For this vignette, we will create and use a synthetic dataset.
library(dplyr)
set.seed(54321)
N = 40
c1 <- rnorm(N, mean = 100, sd = 25)
c2 <- rnorm(N, mean = 100, sd = 50)
g1 <- rnorm(N, mean = 120, sd = 25)
g2 <- rnorm(N, mean = 80, sd = 50)
g3 <- rnorm(N, mean = 100, sd = 12)
g4 <- rnorm(N, mean = 100, sd = 50)
gender <- c(rep('Male', N/2), rep('Female', N/2))
dummy <- rep("Dummy", N)
id <- 1: N
wide.data <-
tibble::tibble(
Control1 = c1, Control2 = c2,
Group1 = g1, Group2 = g2, Group3 = g3, Group4 = g4,
Dummy = dummy,
Gender = gender, ID = id)
my.data <-
wide.data %>%
tidyr::gather(key = Group, value = Measurement, -ID, -Gender, -Dummy)
head(my.data)
## # A tibble: 6 x 5
## Dummy Gender ID Group Measurement
## <chr> <chr> <int> <chr> <dbl>
## 1 Dummy Male 1 Control1 95.5
## 2 Dummy Male 2 Control1 76.8
## 3 Dummy Male 3 Control1 80.4
## 4 Dummy Male 4 Control1 58.7
## 5 Dummy Male 5 Control1 89.8
## 6 Dummy Male 6 Control1 72.6
This dataset is a tidy dataset, where each observation (datapoint) is a row, and each variable (or associated metadata) is a column. dabestr
requires that data be in this form, as do other popular R packages for data visualization and analysis.
The dabest
function is the main workhorse of the dabestr
package. To create a two-group estimation plot (aka a Gardner-Altman plot), we must first specify the following:
x
and y
columns,paired = TRUE
or paired = FALSE
,idx
.library(dabestr)
## Loading required package: magrittr
two.group.unpaired <-
my.data %>%
dabest(Group, Measurement,
# The idx below passes "Control" as the control group,
# and "Group1" as the test group. The mean difference
# will be computed as mean(Group1) - mean(Control1).
idx = c("Control1", "Group1"),
paired = FALSE)
# Calling the object automatically prints out a summary.
two.group.unpaired
## dabestr (Data Analysis with Bootstrap Estimation in R) v0.3.0
## =============================================================
##
## Good morning!
## The current time is 11:27 AM on Monday July 13, 2020.
##
## Dataset : .
## The first five rows are:
## # A tibble: 5 x 5
## Dummy Gender ID Group Measurement
## <chr> <chr> <int> <fct> <dbl>
## 1 Dummy Male 1 Control1 95.5
## 2 Dummy Male 2 Control1 76.8
## 3 Dummy Male 3 Control1 80.4
## 4 Dummy Male 4 Control1 58.7
## 5 Dummy Male 5 Control1 89.8
##
## X Variable : Group
## Y Variable : Measurement
##
## Effect sizes(s) will be computed for:
## 1. Group1 minus Control1
To compute the mean difference between Group1
and Control1
, we apply the mean_diff()
function to the dabest
object created above.
two.group.unpaired.meandiff <- mean_diff(two.group.unpaired)
# Calling the above object produces a textual summary of the computed effect size.
two.group.unpaired.meandiff
## dabestr (Data Analysis with Bootstrap Estimation in R) v0.3.0
## =============================================================
##
## Good morning!
## The current time is 11:27 AM on Monday July 13, 2020.
##
## Dataset : .
## X Variable : Group
## Y Variable : Measurement
##
## Unpaired mean difference of Group1 (n = 40) minus Control1 (n = 40)
## 19.2 [95CI 7.62; 30.6]
##
##
## 5000 bootstrap resamples.
## All confidence intervals are bias-corrected and accelerated.
As of dabest
v0.3.0, there are five effect sizes available:
mean_diff()
.median_diff()
.cohens_d()
.hedges_g()
.cliffs_delta()
.To create a two-group estimation plot (aka a Gardner-Altman plot) from this data, simply use plot(dabest_effsize.object)
.
plot(two.group.unpaired.meandiff, color.column = Gender)
This is known as a Gardner-Altman estimation plot, after Martin J. Gardner and Douglas Altman who were the first to publish it in 1986.
The key features of the Gardner-Altman estimation plot are:
The estimation plot produced by dabest
differs from the one first introduced by Gardner and Altman in one important aspect. dabest
derives the 95% CI through nonparametric bootstrap resampling. This enables visualization of the confidence interval as a graded sampling distribution.
The 95% CI presented is bias-corrected and accelerated (ie. a BCa bootstrap). You can read more about bootstrap resampling and BCa correction here.
You can also obtain Gardner-Altman plots for the median difference, Cohen’s d, and Hedges’ g effect sizes. Below we demonstrate how to obtain one for the Hedges’ g of the loaded two.group.unpaired
dataset.
two.group.unpaired %>% hedges_g() %>% plot(color.column = Gender)
If you have paired or repeated observations, you must specify the id.col
, a column in the data that indicates the identity of each paired observation. This will produce a Tufte slopegraph instead of a swarmplot.
two.group.paired <-
my.data %>%
dabest(Group, Measurement,
idx = c("Control1", "Group1"),
paired = TRUE, id.col = ID)
# The summary indicates this is a paired comparison.
two.group.paired
## dabestr (Data Analysis with Bootstrap Estimation in R) v0.3.0
## =============================================================
##
## Good morning!
## The current time is 11:27 AM on Monday July 13, 2020.
##
## Dataset : .
## The first five rows are:
## # A tibble: 5 x 5
## Dummy Gender ID Group Measurement
## <chr> <chr> <int> <fct> <dbl>
## 1 Dummy Male 1 Control1 95.5
## 2 Dummy Male 2 Control1 76.8
## 3 Dummy Male 3 Control1 80.4
## 4 Dummy Male 4 Control1 58.7
## 5 Dummy Male 5 Control1 89.8
##
## X Variable : Group
## Y Variable : Measurement
##
## Paired effect size(s) will be computed for:
## 1. Group1 minus Control1
# Create a paired plot.
two.group.paired %>%
mean_diff() %>%
plot(color.column = Gender)
To create a multi-two group plot, one will need to specify a list, with each element of the list corresponding to the each two-group comparison.
multi.two.group.unpaired <-
my.data %>%
dabest(Group, Measurement,
idx = list(c("Control1", "Group1"),
c("Control2", "Group2")),
paired = FALSE)
# Compute the mean difference.
multi.two.group.unpaired.meandiff <- mean_diff(multi.two.group.unpaired)
# Create a multi-two group plot.
multi.two.group.unpaired.meandiff %>%
plot(color.column = Gender)
This is a Cumming estimation plot. It is heavily influenced by the plot designs of Geoff Cumming in his 2012 text Understanding the New Statistics. The effect size and 95% CIs are plotted a separate axes that is now positioned below the raw data. In addition, summary measurements are displayed as gapped lines to the right of each group. These vertical lines are identical to conventional mean ± standard deviation error bars. Here, the mean of each group is indicated as a gap in the line, drawing inspiration from Edward Tufte’s low data-ink ratio dictum.
By default, dabestr
plots the mean ± standard deviation of each group as a gapped line beside each group. The group.summaries = 'median_quartiles'
parameter will plot the median and 25th & 75th percentiles of each group is plotted instead. If group.summaries = NULL
, the summaries are not shown.
plot(multi.two.group.unpaired.meandiff,
color.column = Gender,
group.summaries = "median_quartiles")
One can also produce a multi-paired plot.
multi.two.group.paired <-
my.data %>%
dabest(Group, Measurement,
idx = list(c("Control1", "Group1"),
c("Control2", "Group2")),
paired = TRUE, id.col = ID
)
multi.two.group.paired.mean_diff <- mean_diff(multi.two.group.paired)
plot(multi.two.group.paired.mean_diff,
color.column = Gender,
slopegraph = TRUE)
multi.group <-
my.data %>%
dabest(Group, Measurement,
idx = list(c("Control1", "Group1", "Group3"),
c("Control2", "Group2", "Group4")),
paired = FALSE
)
multi.group.mean_diff <- multi.group %>% mean_diff()
plot(multi.group.mean_diff, color.column = Gender)
You can control several graphical aspects of the estimation plot.
Use the rawplot.ylim
and effsize.ylim
parameters to supply custom y-limits for the rawplot and the delta plot, respectively.
plot(multi.group.mean_diff,
color.column = Gender,
rawplot.ylim = c(-100, 200),
effsize.ylim = c(-60, 60)
)
You can control the size of the dots used to create the rawplot data with rawplot.markersize
. The default size (in points) is 2.
To obtain an aesthetically-pleasing plot, You should use this option in tandem with the rawplot.groupwidth
option. This sets the maximum amount that each group of datapoints is allowed to spread in the x-direction. The default is 0.3.
plot(multi.group.mean_diff,
color.column = Gender,
rawplot.markersize = 1,
rawplot.groupwidth = 0.4
)
The rawplot.ylabel
and effsize.ylabel
parameters control the y-axis titles for the rawplot and the delta plot, respectively.
plot(multi.group.mean_diff,
color.column = Gender,
rawplot.ylabel = "Rawplot Title?",
effsize.ylabel = "My delta plot!"
)
The axes.title.fontsize
parameter determines the fontsize of both the rawplot and deltaplot y-axes titles.
plot(multi.group.mean_diff,
color.column = Gender,
axes.title.fontsize = 10 # default is 14.
)
The palette
parameter accepts either one of the following:
RColorBrewer
palettes. The default palette applied is “Set2”.plot(multi.group.mean_diff,
color.column = Gender,
palette = "Dark2" # The default is "Set1".
)
plot(multi.group.mean_diff,
color.column = Gender,
# A custom palette consisting of a vector of colors,
# specified as RGB hexcode, or as a R named color.
# See all 657 named R colors with `colors()`.
palette = c("#FFA500", "sienna4")
)
You can use the theme
parameter to pass along any ggplot2 themes. The default ggplot2
theme is theme_classic()
.
plot(multi.group.mean_diff,
color.column = Gender,
theme = ggplot2::theme_gray()
)
Read more about how the estimation plot combines statistical rigour and visual design. You might also be interested in finding out more about bootstrap confidence intervals.