User Guide

‘ggpmisc’ 0.3.5

Pedro J. Aphalo

2020-06-01

Preliminaries

We load all the packages used in the examples.

library(ggpmisc)
## Loading required package: ggplot2
library(ggrepel)
library(xts)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(tibble)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:xts':
## 
##     first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(nlme)
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse

As we will use text and labels on the plotting area we change the default theme to an uncluttered one.

old_theme <- theme_set(theme_bw())

Introduction

Many of the functions, including ggplot statistics and geoms, included in package ‘ggpmisc’ had their origin in my need to produce plots for use in teaching or research reports. Some of them are generally useful, such as stat_poly_eq() and stat_quadrant_counts()but others like stat_fit_deviations() are squarely aimed and producing learning material. Function try_tibble() opens the door to easily converting time series stored in objects of several different classes into data frames. This is, for example, useful for plotting time series data with ggplot(). New ggplot() method specializations for classes ts and xts make the call to try_tibble() and conversion automatic.

ggplot methods for time series

ggplot() methods for classes "ts" and "xts" automate plotting of time series data, as x and y aesthetics are mapped to time and the variable of the time series, respectively. For plotting time series data stored in objects of other classes, see the conversion functions try_tibble() and try_data_frame() in the last section of this vignette.

class(lynx)
## [1] "ts"
ggplot(lynx) + geom_line()

It is possible to control the conversion of the time variable to numeric or datetime. As we will see below this affects the scale used by default as well as the formatting of values when converted to character strings or printed.

ggplot(lynx, as.numeric = FALSE) + geom_line()

geom_table and stat_fmt_table

The geometry geom_table() plots a data frame or tibble, nested in a tibble passed as data argument, using aesthetics x and y for positioning, and label for the data frame containing the data for the table. The table is created as a ‘grid’ grob and added as usual to the ggplot object. In contrast to “standard” geoms, this geom by default does not inherit the globally mapped aesthetics.

tb <- mpg %>%
  group_by(cyl) %>%
  summarise(hwy = median(hwy), cty = median(cty))
data.tb <- tibble(x = 7, y = 44, tb = list(tb))
ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +
  geom_table(data = data.tb, aes(x, y, label = tb)) +
  geom_point() 

Table themes are supported through parameter table.theme and if variables or constants are mapped to the colour, fill, size, family, aesthetics they override the corresponding theme settings. The display of rownames and colnames can be enabled or disable through parameter table.rownames and table.colnames and the horizontal justification of text in the core of the table through parameter table.hjust.

Parameter table.theme accepts as arguments NULL for use of the current default, a ttheme constructor function such as those defined in package ‘gridExtra’, or the variations on them defined in this package. The current default can be set with function ttheme_set().

ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +
  geom_table(data = data.tb, aes(x, y, label = tb),
             table.theme = ttheme_gtsimple,
             table.hjust = 0, colour = "darkred", fill = "#FFFFBB") +
  geom_point() 

Using stat_fmt_tb() we can rename columns of the tibble, and/or select a subset of columns or a slice of rows as shown below. To provide a complete example we also replace the names of the x, y and colour aesthetics.

ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +
  geom_table(data = data.tb, aes(x, y, label = tb),
             table.theme = ttheme_gtlight,
             size = 3, colour = "darkblue",
             stat = "fmt_tb", tb.vars = c(Cylinders = "cyl", "MPG" = "hwy")) +
  labs(x = "Engine displacement (l)", y = "Fuel use efficiency (MPG)",
       colour = "Engine cylinders\n(number)") +
  geom_point() +
  theme_bw()

Parsed text, using plot math syntax is supported in the table, with fall-back to plain text in case of parsing errors, on a cell by cell basis. Here we plot the MPG for city traffic and we can see that the plotting area expands to include the coordinates at which the table is anchored. Justification is by default set to "inward" which ensures that the table is fully within the plotting region.

tb.pm <- tibble(Parameter = c("frac(beta[1], a^2)", "frac(beta[2], a^3)"),
                Value = c("10^2.4", "10^3.532"))
data.tb <- tibble(x = 7, y = 44, tb = list(tb.pm))
ggplot(mpg, aes(displ, cty)) +
  geom_point() +
  geom_table(data = data.tb, aes(x, y, label = tb), parse = TRUE) +
  theme_bw()

As implemented, there is no limitation to the number of insets, and faceting is respected. If the base plot shows a map, multiple small tables could be superimposed on different countries or regions. The size of the insets is set relative to the main plot, so the combined plot can be scaled.

Please see section Normalised Parent Coordinates below for a description of geom_table_npc().

geom_plot

The geom_plot() geometry plots ggplot objects, nested in a tibble passed as data argument, using aesthetics x and y for positioning, and label for the ggplot object containing the definition of the plot to be nested. A Grob is created with ‘ggplotGrob’ and added as usual to the ggplot object.

As an example we produce plot where the inset plot is a zoomed-in detail from the main plot. In this case the main and inset plots start as the same plot.

p <- ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +
  geom_point() 

data.tb <- tibble(x = 7, y = 44, 
                  plot = list(p + 
                                coord_cartesian(xlim = c(4.9, 6.2), ylim = c(13, 21)) +
                                labs(x = NULL, y = NULL) +
                                theme_bw(8) +
                                scale_colour_discrete(guide = FALSE)))

ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +
  geom_plot(data = data.tb, aes(x, y, label = plot)) +
  annotate(geom = "rect", 
           xmin = 4.9, xmax = 6.2, ymin = 13, ymax = 21,
           linetype = "dotted", fill = NA, colour = "black") +
  geom_point() 

In general, the inset plot can be any ggplot object, allowing the creation of very different combinations of main plot and inset plot. Here we use the inset to summaries as in the previous example of an inset table.

p <- ggplot(mpg, aes(factor(cyl), hwy, fill = factor(cyl))) +
  stat_summary(geom = "col", fun = mean, width = 2/3) +
  labs(x = "Number of cylinders", y = NULL, title = "Means") +
  scale_fill_discrete(guide = FALSE)

data.tb <- tibble(x = 7, y = 44, 
                  plot = list(p +
                                theme_bw(8)))

ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +
  geom_plot(data = data.tb, aes(x, y, label = plot)) +
  geom_point() +
  labs(x = "Engine displacement (l)", y = "Fuel use efficiency (MPG)",
       colour = "Engine cylinders\n(number)") +
  theme_bw()

As implemented, there is no limitation to the number of insets, and faceting is respected. If the base plot shows a map, multiple small plots could be superimposed on different countries or regions. The size of the insets is set relative to the main plot, so the combined plot can be scaled. A possible unintuitive but useful feature, is that the theme is linked to each plot.

Please see section Normalised Parent Coordinates below for a description of geom_plot_npc().

geom_grob

The geom_grob() geometry plots grobs (graphical objects as created with ‘grid’), nested in a tibble passed as data argument, using aesthetics x and y for positioning, and label for the Grob object. While geom_table() and geom_plot() take as values to plot tibbles or data frames, and ggplots, respectively, and convert them into Grobs before adding them to the plot, geom_grob() expects Grobs ready to be rendered.

file.name <- 
  system.file("extdata", "Isoquercitin.png", 
              package = "ggpmisc", mustWork = TRUE)
Isoquercitin <- magick::image_read(file.name)
grobs.tb <- tibble(x = c(0, 10, 20, 40), y = c(4, 5, 6, 9),
                   width = c(0.05, 0.05, 0.01, 1),
                   height =  c(0.05, 0.05, 0.01, 0.3),
                   grob = list(grid::circleGrob(), 
                               grid::rectGrob(), 
                               grid::textGrob("I am a Grob"),
                               grid::rasterGrob(image = Isoquercitin)))

ggplot() +
  geom_grob(data = grobs.tb, 
            aes(x, y, label = grob, vp.width = width, vp.height = height),
            hjust = 0.7, vjust = 0.55) +
  scale_y_continuous(expand = expansion(mult = 0.3, add = 0)) +
  scale_x_continuous(expand = expansion(mult = 0.2, add = 0)) +
 theme_bw(12)

geom_grob() is designed thinking that its main use will in graphical annotations, although one could use it for infographics with multiple copies of each grob, this would go against the grammar of graphics. In this implementation grobs cannot to mapped to an aesthetic through a scale.

As implemented, there is no limitation to the number of insets, and faceting is respected. If the base plot shows a map, multiple simple grobs (e.g. national flags) could be superimposed on different countries. The size of the insets is set relative to the main plot, so the combined plot can be scaled.

Please see next section for a description of geom_grob_npc().

geom_vhlines

This is a convenience geometry that adds both vertical and horizontal guide lines on the same plot layer, using the same syntax as geom_hline() and geom_vline() from package ‘ggplot2’.

ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +
  geom_vhlines(xintercept = c(2.75, 4), yintercept = 27, linetype = "dashed") +
  geom_point() +
  labs(x = "Engine displacement (l)", y = "Fuel use efficiency (MPG)",
       colour = "Engine cylinders\n(number)") +
  theme_bw()

Normalised Parent Coordinates

R’s ‘grid’ package defines several units that can be used to describe the locations of plot elements. In ‘ggplot2’ the x and y aesthetics are directly mapped to "native" or data units. For consistent location of annotations with respect to the plotting area we need to rely on "npc" which are expressed relative to the size of the grid viewport, as the plotting area in a ggplot if implemented as viewport.

To support this we have implemented scales for two new aesthetics, npcx and npcy. These are very simple continuous scales which do not support any transformation or change to their limits, which are meaningless for "npc" units. Variables mapped to these aesthetics can be either numerical with values in the range zero to one or character. A limited set of strings are recognised and converted to “npc” units: "top", "bottom", "left", "right" and "centre" (or its synonyms "center" and "middle"). Justification defaults to “inward”.

To make these scales useful we need also to define geometries that use these new aesthetics. Package ‘ggpmisc’ currently provides geom_text_npc(), geom_label_npc(), geom_table_npc(), geom_plot_npc() and geom_grob_npc().

These are used internally by several of the statistics described below. They can also be useful on their own right in those cases when we want to add annotations at specific positions within the plotting area, while the usual x and y aesthetics should be used whenever the positions of plot elements represent data values. When writing scripts or functions that may be applied to different data sets they help in keeping the code concise and reusable.

As an example let’s imagine that we want to add an institutional logo to a plot. Its position has nothing to do with the data mapped to x and y, so it is conceptually better to use "npc" coordinates. The big practical advantage is that this also allows to keep this part of the plot definition independent of the data being plotted, giving a major advantage in the case of plots with facets with free scale limits.

We assume here that the tiger is being used as a logo, and we simply map all aesthetics to constant values.

file.name <- 
  system.file("extdata", "Robinin.png", 
              package = "ggpmisc", mustWork = TRUE)
Robinin <- magick::image_read(file.name)

set.seed(123456)
data.tb <- tibble(x = 1:20, y = (1:20) + rnorm(20, 0, 10))

ggplot(data.tb, aes(x, y)) +
  geom_grob_npc(label = list(grid::rasterGrob(image = Robinin)), 
                npcx = 0.01, npcy = 0.95, hjust = 0, vp.width = 0.5) +
  geom_point() +
  expand_limits(y = 45, x = 0)

Alternatively we produce the same plot by creating a data frame to contain the grob and the coordinates, and then map these variable to the aesthetics using aes().

flavo.tb <- tibble(x = 0.02,
                   y = 0.95,
                   width = 1/2,
                   height = 1/4,
                   grob = list(grid::rasterGrob(image = Robinin)))

ggplot(data.tb, aes(x, y)) +
  geom_grob_npc(data = flavo.tb, 
                aes(label = grob, npcx = x, npcy = y, 
                    vp.width = width, vp.height = height)) +
  geom_point() +
  expand_limits(y = 55, x = 0)

Two geometries are based on existing ‘ggplot2’ geometries. They are almost like wrappers built on top of geom_text() and geom_label(). We give an example using geom_text_npc() to produce a “classic” labelling for facets matching the style of theme_classic() and traditional scientific journals’ design.

corner_letters.tb <- tibble(label = LETTERS[1:4],
                            x = "left", 
                            y = "top",
                            cyl = c(4,5,6,8))
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  facet_wrap(~cyl, scales = "free") +
  geom_text_npc(data = corner_letters.tb,
                aes(npcx = x, npcy = y, label = label)) +
  theme_classic() +
  theme(strip.background = element_blank(),
        strip.text.x = element_blank())

Marginal markings

‘ggplot2’ provides geom_rug(), geom_vline() and geom_hline(). Rug plots are intended to be used to represent distributions along the margins of plot. geom_vline() and geom_hline()are normally used to separate regions in a plot or to highlight important values along the x or y axis. When creating plots it is sometimes useful to put small marks along the axes, just inside the plotting area, similar to those in a rug plot, but like geom_vline() and geom_hline() in their purpose.

Three geometries provide such markers: geom_margin_point(), geom_margin_arrow(), and geom_margin_grob(). They behave similarly to geom_vline() and geom_hline() and their positions are determined also by the xintercept and yintercept aesthetics.

In the example below we indicate the group medians along the x axis with filled triangles.

data.tb <- mpg %>%
  group_by(cyl) %>%
  summarise(hwy = median(hwy), displ = median(displ))
ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +
  geom_x_margin_point(data = data.tb,
                      aes(xintercept = displ, fill = factor(cyl))) +
  expand_limits(y = 10) +
  geom_point() 

stat_peaks and stat_valleys

Two statistics make it possible to highlight and/or label peaks and valleys (local maxima and local minima) in a curve. Here we use a time series, using POSIXct for time and the default formatting of labels. We also instruct geom_text() not to render overlapping labels.

ggplot(lynx, as.numeric = FALSE) + geom_line() + 
  stat_peaks(colour = "red") +
  stat_peaks(geom = "text", colour = "red", vjust = -0.5, 
             check_overlap = TRUE) +
  ylim(-100, 7300)

Using numeric values for time and the default formatting of labels. We can also pass other aesthetics recognised the geom_text() such as angle. Here we also highlight and label the valleys.

ggplot(lynx) + geom_line() + 
  stat_peaks(colour = "red") +
  stat_peaks(geom = "text", colour = "red", hjust = -0.2, vjust = 0.5, 
             angle = 90, check_overlap = TRUE) +
  stat_valleys(colour = "blue") +
  stat_valleys(geom = "text", colour = "blue", hjust = 1.2, vjust = 0.5, 
             angle = 90, check_overlap = TRUE) +
  expand_limits(y = c(-900, 8000))

Using POSIXct for time but supplying a format string. In addition marking both peaks and valleys.

ggplot(lynx, as.numeric = FALSE) + geom_line() + 
  stat_peaks(colour = "red") +
  stat_peaks(geom = "text", colour = "red", hjust = -0.2, vjust = 0.5, 
             angle = 90, check_overlap = TRUE, x.label.fmt = "%Y") +
  stat_valleys(colour = "blue") +
  stat_valleys(geom = "text", colour = "blue", hjust = 1.2, vjust = 0.5, 
             angle = 90, check_overlap = TRUE, x.label.fmt = "%Y") +
  expand_limits(y = c(-900, 8000))

Using geom_rug for the peaks and valleys.

ggplot(lynx, as.numeric = FALSE) + geom_line() + 
  stat_peaks(colour = "red") +
  stat_peaks(geom = "rug", colour = "red") +
  stat_valleys(colour = "blue") +
  stat_valleys(geom = "rug", colour = "blue")

stat_quadrant_counts and geom_quadrant_lines

This statistic automates the annotation of plots with number of observations, either by quadrant, by pairs of quadrants or the four quadrants taken together (whole plotting area). Its companion geometry, geom_quadrant_lines() is used in the examples to highlight the quadrants.

We generate some artificial data.

set.seed(4321)
# generate artificial data
x <- -99:100
y <- x + rnorm(length(x), mean = 0, sd = abs(x))
my.data <- data.frame(x, 
                      y, 
                      group = c("A", "B"))

Using defaults except for colour.

ggplot(my.data, aes(x, y)) +
  geom_quadrant_lines(colour = "red") +
  stat_quadrant_counts(colour = "red") +
  geom_point() +
  expand_limits(y = c(-250, 250))

Pooling quadrants along the x-axis. (pool.along = "y" pools along y.)

ggplot(my.data, aes(x, y)) +
  geom_quadrant_lines(colour = "red", pool.along = "x") +
  stat_quadrant_counts(colour = "red", pool.along = "x") +
  geom_point()

Manual positioning of the text annotations and pooling of all four quadrants, and overriding the default formatting for the label.

ggplot(my.data, aes(x, y)) +
  geom_point() +
  stat_quadrant_counts(quadrants = 0L, label.x = "left", 
                       aes(label = sprintf("%i observations", stat(count))))

Annotation of only specific quadrants.

ggplot(my.data, aes(x, y)) +
  geom_quadrant_lines(colour = "red") +
  stat_quadrant_counts(colour = "red", quadrants = c(2, 4)) +
  geom_point()

Using facets, even with free scale limits, the labels are placed consistently. This achieved by the default use of geom_text_npc() or as shown below by use of `geom_label_npc(). We expand the y limits to ensure that no observations are occluded by the labels.

ggplot(my.data, aes(x, y, colour = group)) +
  geom_quadrant_lines() +
  stat_quadrant_counts(geom = "label_npc") +
  geom_point() +
  expand_limits(y = c(-260, 260)) +
  facet_wrap(~group)

stat_poly_eq

We generate another set of artificial data.

set.seed(4321)
# generate artificial data
x <- 1:100
y <- (x + x^2 + x^3) + rnorm(length(x), mean = 0, sd = mean(x^3) / 4)
my.data <- data.frame(x, 
                      y, 
                      group = c("A", "B"), 
                      y2 = y * c(0.5,2),
                      block = c("a", "a", "b", "b"),
                      wt = sqrt(x))

First one example using defaults. To ensure that the formula used in both stats is the same, we first save it to a variable.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  stat_smooth(method = "lm", formula = formula) +
  stat_poly_eq(formula = formula, parse = TRUE)

A ready formatted equation is also returned as a string that needs to be parsed into an expression for display.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = stat(eq.label)), formula = formula, 
               parse = TRUE)

stat_poly_eq() makes available several character strings in the returned data frame in separate columns: eq.label, rr.label, adj.rr.label, AIC.label, BIC.label, f.value.label and p.value.label. One of these, rr.label, is used by default, but aes() can be used to map a different one to the label aesthetic. Here we show the adjusted coefficient of determination and AIC. We call stat_poly_eq() twice to be able to separately control the positions of the two labels.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = stat(adj.rr.label)), formula = formula, 
               parse = TRUE) +
  stat_poly_eq(aes(label = stat(AIC.label)),
               label.x = "right", label.y = "bottom", size = 3,     
               formula = formula, 
               parse = TRUE)

Within aes() it is also possible to compute new labels based on those returned plus “arbitrary” text. The labels returned by default are meant to be parsed into expressions, so any text added should be valid for a string that will be parsed. Inserting a comma plus white space in an expression requires some trickery in the argument passed to sep. Do note the need to escape the embedded quotation marks as \". Combining the labels in this way ensures correct alignment. To insert only white space sep = "~~~~~" can be used, with each tilde character, ~, adding a rather small amount of white space. We shown only here, not to clutter the remaining examples, how to change the axis labels to ensure consistency with the typesetting of the equation.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label =  paste(stat(eq.label), stat(adj.rr.label), 
                                  sep = "*\", \"*")),
               formula = formula, parse = TRUE) +
  labs(x = expression(italic(x)), y = expression(italic(y)))

Above we combined two character-string labels, but it is possible to add additional ones and even character strings constants. In this example we use several labels instead of just two, and separate them various character strings. We also change the size of the text in the label.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label =  paste(stat(eq.label), "*\" with \"*", 
                                  stat(rr.label), "*\", \"*", 
                                  stat(f.value.label), "*\", and \"*",
                                  stat(p.value.label), "*\".\"",
                                  sep = "")),
               formula = formula, parse = TRUE, size = 3)

It is also possible to format the text using plain(), italic(), bold() or bolditalic() as described in plotmath.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label =  paste(stat(eq.label), stat(adj.rr.label), 
                                  sep = "~~italic(\"with\")~~")),
               formula = formula, parse = TRUE)

As these are expressions, to include two lines of text, we need either to add stat_poly_eq() twice, passing an argument to label.y to reposition one of the labels (as shown above) or use (as shown here) atop() within the expression to create a single label with two lines of text.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = paste("atop(", stat(AIC.label), ",", stat(BIC.label), ")", sep = "")), 
               formula = formula, 
               parse = TRUE)

Next, one example how to remove the left hand side (lhs).

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = stat(eq.label)),
               eq.with.lhs = FALSE,
               formula = formula, parse = TRUE)

Replacing the lhs.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = stat(eq.label)),
               eq.with.lhs = "italic(hat(y))~`=`~",
               formula = formula, parse = TRUE)

And a final example replacing both the lhs and the variable symbol used on the rhs.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  labs(x = expression(italic(z)), y = expression(italic(h)) ) + 
  stat_poly_eq(aes(label = stat(eq.label)),
               eq.with.lhs = "italic(h)~`=`~",
               eq.x.rhs = "~italic(z)",
               formula = formula, parse = TRUE)

As any valid R expression can be used, Greek letters are also supported, as well as the inclusion in the label of variable transformations used in the model formula or applied to the data. In addition, in the next example we insert white space in between the parameter estimates and the variable symbol in the equation.

formula <- y ~ poly(x, 2, raw = TRUE)
ggplot(my.data, aes(x, log10(y + 1e6))) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = stat(eq.label)),
               eq.with.lhs = "plain(log)[10](italic(delta)+10^6)~`=`~",
               eq.x.rhs = "~Omega",
               formula = formula, parse = TRUE) +
  labs(y = expression(plain(log)[10](italic(delta)+10^6)), x = expression(Omega))

A couple of additional examples of polynomials of different orders, and specified in different ways.

Higher order polynomial.

formula <- y ~ poly(x, 5, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = stat(eq.label)), formula = formula, parse = TRUE)

Intercept forced to zero.

formula <- y ~ x + I(x^2) + I(x^3) - 1
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = stat(eq.label)), formula = formula, 
               parse = TRUE)

We give below several examples to demonstrate how other components of the ggplot object affect the behaviour of this statistic.

Facets work as expected either with fixed or free scales. Although below we had to adjust the size of the font used for the equation.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y2)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = stat(eq.label)), size = 3,
               formula = formula, parse = TRUE) +
  facet_wrap(~group)

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y2)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = stat(eq.label)), size = 3,
               formula = formula, parse = TRUE) +
  facet_wrap(~group, scales = "free_y")

Grouping, in this example using the colour aesthetic also works as expected. We can use justification and supply an absolute location for the equation, but the default frequently works well as in the example below.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y2, colour = group)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = stat(eq.label)),
               formula = formula, parse = TRUE)

Label positions relative to the ranges of the x and y scales are also supported, both through string constants and numeric values in the range 0 to 1, when using the default geom_text_npc().

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y2, colour = group)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = stat(eq.label)),
               formula = formula, parse = TRUE,
               label.x = "centre")

The default locations are now based on normalized coordinates, and consequently these defaults work even when the range of the x and y scales varies from panel to panel.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y2, fill = block)) +
  geom_point(shape = 21, size = 3) +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = stat(rr.label)), size = 3,
               geom = "label_npc", alpha = 0.33,
               formula = formula, parse = TRUE) +
  facet_wrap(~group, scales = "free_y")

If grouping is not the same within each panel created by faceting the automatic location of labels results in “spillover” of the automatic positioning of labels, and consistent positioning requires passing explicitly positions for each individual group. In the plot below the simultaneous mapping to both colour and fill aesthetics creates four groups, two with data in panel A and the other two in panel B.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y2, colour = group, fill = block)) +
  geom_point(shape = 21, size = 3) +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = stat(rr.label)), size = 3, alpha = 0.2,
               geom = "label_npc", label.y = c(0.95, 0.85, 0.95, 0.85),
               formula = formula, parse = TRUE) +
  facet_wrap(~group, scales = "free_y")

It is possible to use geom_text() and geom_label() but in this case, if label coordinates are given explicitly they should be expressed in native data coordinates. When multiple labels need to be positioned a vector of coordinates can be used as shown here for label.x and label.y.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y2, colour = group)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(geom = "text", aes(label = stat(eq.label)),
               label.x = c(100, 90), label.y = c(-1e4, 2.1e6), hjust = "inward",
               formula = formula, parse = TRUE)

Note: Automatic positioning using geom_text() and geom_label() is not supported when faceting uses free scales. In this case geom_text_npc() and/or geom_label_npc() must be used.

stat_fit_residuals

I had the need to quickly plot residuals matching fits plotted with geom_smooth() using grouping and facets, so a new (simple) statistic was born.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y, colour = group)) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  stat_fit_residuals(formula = formula)

stat_fit_deviations

As I also had the need to highlight residuals in slides and notes to be used in teaching, another statistic was born.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_smooth(method = "lm", formula = formula) +
  stat_fit_deviations(formula = formula, colour = "red") +
  geom_point()

The geometry used by default is geom_segment() to which additional aesthetics can be mapped. Here we add arrowheads.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_smooth(method = "lm", formula = formula) +
  geom_point() +
  stat_fit_deviations(formula = formula, colour = "red",
                      arrow = arrow(length = unit(0.015, "npc"), 
                                   ends = "both"))

stat_fit_glance

Package ‘broom’ provides consistently formatted output from different model fitting functions. These makes possible to write model-annotation statistic that are very flexible put that require more effort with the definition of the character strings to map to the label aesthetic.

As we have above given some simple examples, we here exemplify this statistic in a plot with grouping, and assemble a label for the P-value using a string parsed into a expression. We also change the default position of the labels.

# formula <- y ~ poly(x, 3, raw = TRUE)
# broom::augment does not handle poly() correctly!
formula <- y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y, colour = group)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_fit_glance(method = "lm", 
                  method.args = list(formula = formula),
                  label.x = "right",
                  label.y = "bottom",
                  aes(label = paste("italic(P)*\"-value = \"*", 
                                    signif(stat(p.value), digits = 4), sep = "")),
                  parse = TRUE)

It is also possible to fit a non-linear model with method = "nls", and any other model for which a glance() method exists. Do consult the documentation for package ‘broom’. Here we fit the Michaelis-Menten equation to reaction rate versus concentration data.

micmen.formula <- y ~ SSmicmen(x, Vm, K) 
ggplot(Puromycin, aes(conc, rate, colour = state)) +
  geom_point() +
  geom_smooth(method = "nls", 
              formula = micmen.formula,
              se = FALSE) +
  stat_fit_glance(method = "nls", 
                  method.args = list(formula = micmen.formula),
                  aes(label = paste("AIC = ", signif(stat(AIC), digits = 3), 
                                    ", BIC = ", signif(stat(BIC), digits = 3),
                                    sep = "")),
                  label.x = "centre", label.y = "bottom")

stat_fit_tidy

This stat makes it possible to add the equation for any fitted model for which broom::tidy() is implemented. Alternatively, individual values such as estimates for the fitted parameters, standard errors, or P-values can be added to a plot. However, the user has to explicitly construct the labels within aes(). This statistic respects grouping based on aesthetics, and reshapes the output of tidy() so that the values for a given group are in a single row in the returned data.

micmen.formula <- y ~ SSmicmen(x, Vm, K) 
ggplot(Puromycin, aes(conc, rate, colour = state)) +
  geom_point() +
  geom_smooth(method = "nls", 
              formula = micmen.formula,
              se = FALSE) +
  stat_fit_tidy(method = "nls", 
                method.args = list(formula = micmen.formula),
                label.x = "right",
                label.y = "bottom",
                aes(label = paste("V[m]~`=`~", signif(stat(Vm_estimate), digits = 3),
                                  "%+-%", signif(stat(Vm_se), digits = 2),
                                  "~~~~K~`=`~", signif(stat(K_estimate), digits = 3),
                                  "%+-%", signif(stat(K_se), digits = 2),
                                  sep = "")),
                parse = TRUE)

micmen.formula <- y ~ SSmicmen(x, Vm, K) 
ggplot(Puromycin, aes(conc, rate, colour = state)) +
  geom_point() +
  geom_smooth(method = "nls", 
              formula = micmen.formula,
              se = FALSE) +
  stat_fit_tidy(method = "nls", 
                method.args = list(formula = micmen.formula),
                size = 3,
                label.x = "center",
                label.y = "bottom",
                vstep = 0.18,
                aes(label = paste("V~`=`~frac(", signif(stat(Vm_estimate), digits = 2), "~C,",
                                  signif(stat(K_estimate), digits = 2), "+C)",
                                  sep = "")),
                parse = TRUE) +
  labs(x = "C", y = "V")

stat_fit_tb

This stat makes it possible to add summary or ANOVA tables for any fitted model for which broom::tidy() is implemented. The output from tidy() is embedded as a single list value within the returned data, an object of class tibble. This statistic ignores grouping based on aesthetics. This allows fitting models when x or y is a factor (as in such cases ggplot splits the data into groups, one for each level of the factor, which is needed for example for stat_summary() to work as expected). By default, the "table" geometry is used. The use of geom_table() is described in a separate section of this User Guide. Table themes and mapped aesthetic are supported.

The default output of stat_fit_tb is the default output from tidy(mf) where mf is the fitted model.

formula <- y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_fit_tb(method = "lm",
              method.args = list(formula = formula),
              tb.vars = c(Parameter = "term", 
                          Estimate = "estimate", 
                          "s.e." = "std.error", 
                          "italic(t)" = "statistic", 
                          "italic(P)" = "p.value"),
              label.y = "top", label.x = "left",
              parse = TRUE)

When tb.type = "fit.anova" the output returned is that from tidy(anova(mf)) where mf is the fitted model.

formula <- y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_fit_tb(method = "lm",
              method.args = list(formula = formula),
              tb.type = "fit.anova",
              tb.vars = c(Effect = "term", 
                          "df",
                          "M.S." = "meansq", 
                          "italic(F)" = "statistic", 
                          "italic(P)" = "p.value"),
               label.y = "top", label.x = "left",
              parse = TRUE)

When tb.type = "fit.coefs" the output returned is that of tidy(mf) after selecting the term and estimate columns.

formula <- y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_fit_tb(method = "lm",
              method.args = list(formula = formula),
              tb.type = "fit.coefs",
              label.y = "center", label.x = "left")

Faceting works as expected, but grouping is ignored as mentioned above. In this case, the colour aesthetic is not applied to the text of the tables. Furthermore, if label.x.npc or label.y.npc are passed numeric vectors of length > 1, the values are obeyed by the different panels.

micmen.formula <- y ~ SSmicmen(x, Vm, K)
ggplot(Puromycin, aes(conc, rate, colour = state)) +
  facet_wrap(~state) +
  geom_point() +
  geom_smooth(method = "nls",
              formula = micmen.formula,
              se = FALSE) +
  stat_fit_tb(method = "nls",
              method.args = list(formula = micmen.formula),
              tb.type = "fit.coefs",
              label.x = 0.9,
              label.y = c(0.75, 0.2)) +
  theme(legend.position = "none") +
  labs(x = "C", y = "V")

The data in the example below are split by ggplot into six groups based on the levels of the feed factor. However, as stat_fit_tb() ignores groupings, we can still fit a linear model to all the data in the panel.

ggplot(chickwts, aes(factor(feed), weight)) +
  stat_summary(fun.data = "mean_se") +
  stat_fit_tb(tb.type = "fit.anova",
              label.x = "center",
              label.y = "bottom") +
  expand_limits(y = 0)

We can flip the system of coordinates, if desired.

ggplot(chickwts, aes(factor(feed), weight)) +
  stat_summary(fun.data = "mean_se") +
  stat_fit_tb(tb.type = "fit.anova", label.x = "left", size = 3) +
  scale_x_discrete(expand = expansion(mult = c(0.2, 0.5))) +
  coord_flip()

It is also possible to rotate the table using angle.

ggplot(chickwts, aes(factor(feed), weight)) +
  stat_summary(fun.data = "mean_se") +
  stat_fit_tb(tb.type = "fit.anova",
              angle = 90, size = 3,
              label.x = "right", label.y = "center",
              hjust = 0.5, vjust = 0,
              tb.vars = c(Effect = "term", 
                          "df",
                          "M.S." = "meansq", 
                          "italic(F)" = "statistic", 
                          "italic(P)" = "p.value"),
              parse = TRUE) +
  scale_x_discrete(expand = expansion(mult = c(0.1, 0.35))) +
  expand_limits(y = 0)

stat_fit_augment

Experimental! Use ggplot2::stat_smooth instead of stat_fit_augment if possible.

For a single panel and no grouping, there is little advantage in using this statistic compared to the examples in the documentation of package ‘broom’. With grouping and faceting stat_fit_augment may occasionally be more convenient than ggplot2::stat_smooth because of its flexibility.

# formula <- y ~ poly(x, 3, raw = TRUE)
# broom::augment does not handle poly correctly!
formula <- y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  stat_fit_augment(method = "lm",
                   method.args = list(formula = formula))
## Warning: `as_data_frame()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

formula <- y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y, colour = group)) +
  geom_point() +
  stat_fit_augment(method = "lm", 
                   method.args = list(formula = formula))

We can override the variable returned as y to be any of the variables in the data frame returned by broom::augment while still preserving the original y values.

formula <- y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y)) +
  stat_fit_augment(method = "lm",
                   method.args = list(formula = formula),
                   geom = "point",
                   y.out = ".resid")

formula <- y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y, colour = group)) +
  stat_fit_augment(method = "lm",
                   method.args = list(formula = formula),
                   geom = "point",
                   y.out = ".std.resid")

We can use any model fitting method for which augment is implemented.

args <- list(formula = y ~ k * e ^ x,
             start = list(k = 1, e = 2))
ggplot(mtcars, aes(wt, mpg)) +
  geom_point() +
  stat_fit_augment(method = "nls",
                   method.args = args)

args <- list(formula = y ~ k * e ^ x,
             start = list(k = 1, e = 2))
ggplot(mtcars, aes(wt, mpg)) +
  stat_fit_augment(method = "nls",
                   method.args = args,
                   geom = "point",
                   y.out = ".resid")

args <- list(model = y ~ SSlogis(x, Asym, xmid, scal),
             fixed = Asym + xmid + scal ~1,
             random = Asym ~1 | group,
             start = c(Asym = 200, xmid = 725, scal = 350))
ggplot(Orange, aes(age, circumference, colour = Tree)) +
  geom_point() +
  stat_fit_augment(method = "nlme",
                   method.args = args,
                   augment.args = list(data = quote(data)))

stat_dens2d_labels and stat_dens2d_filter

These stats had their origin in an enhancement suggestion for ‘ggrepel’ from Hadley Wickham and discussion with Kamil Slowikowski (ggrepel’s author) and others. In fact the code is based on code Kamil gave during the discussion, but simplified and taking a few further ideas from ggplot::stat_dens2d.

Warning! Which observations are selected by the algorithm used, based on MASS:kde2d, depends strongly on the values of parameters h and n. You may need to alter the defaults by passing explicit arguments to these stats. Beware, though, that what are good values, may depend on individual data sets even if they include the same number of observations. For the selection of observations to work cleanly, the argument for n must create a dense grid. Much larger values of n than in the examples in the documentation of MASS::kde2d and ggplot2::stat_dens2d will be needed in most cases.

Some random data with random labels.

random_string <- function(len = 6) {
paste(sample(letters, len, replace = TRUE), collapse = "")
}

# Make random data.
set.seed(1001)
d <- tibble::tibble(
  x = rnorm(100),
  y = rnorm(100),
  group = rep(c("A", "B"), c(50, 50)),
  lab = replicate(100, { random_string() })
)

stat_dens2d_filter

The stat stat_dens2d_filter filters observations, in other words passes to the geom a subset of the data received as input. The default value for geom is "point".

Using defaults except for the colour aesthetic. Highlight 1/10 of observations from lowest density areas of the plot panel.

ggplot(data = d, aes(x, y)) +
  geom_point() +
  stat_dens2d_filter(colour = "red")

Highlighting 1/4 of the observations by under-plotting with larger black points.

ggplot(data = d, aes(x, y, colour = group)) +
   stat_dens2d_filter(keep.fraction = 0.25,
                      size = 3,
                      colour = "black") +
   geom_point()

A different way of highlighting 1/4 of the observations, using over-plotting with a ‘hollow’ shape. We also shift one group with respect to the other.

ggplot(data = d, aes(x + rep(c(-2,2), rep(50,2)), 
                     y, colour = group)) +
   geom_point() +
   stat_dens2d_filter(shape = 1, size = 3,
                      keep.fraction = 0.25)

Highlight 1/4 of observations from lowest density areas of the plot, with density considered separately for each individual group. In this case grouping is based on the colour aesthetic.

ggplot(data = d, aes(x + rep(c(-2,2), rep(50,2)), 
                     y, colour = group)) +
   geom_point() +
   stat_dens2d_filter_g(shape = 1, size = 3,
                      keep.fraction = 0.25)

Add text labels to 1/10 of the observations. The “text_repel” geom sees only these observations.

ggplot(data = d, aes(x, y, label = lab, colour = group)) +
  geom_point() + 
  stat_dens2d_filter(geom = "text_repel")

Add text labels to 1/2 of the observations.

ggplot(data = d, aes(x, y, label = lab, colour = group)) +
  geom_point() +
  stat_dens2d_filter(geom = "text_repel", keep.fraction = 0.5)

stat_dens2d_labels

The stat stat_dens2d_labels replaces the values of the label (aesthetic) variable in data in the high density regions of the panel with the argument passed to label.fill before passing them to the geom. The default value for geom is "text". The default value of label.fill is "" which results in empty labels, while using NA as fill label results in observations being omitted. Using NA as label.fill is not too different from using stat_dens2d_filter as long as the geom used requires a label aesthetic.

Label 1/10 of observations from lowest density areas of the plot panels.

ggplot(data = d, aes(x, y, label = lab, colour = group)) +
  geom_point() +
  stat_dens2d_labels()

Add text labels to 45% of the observations.

ggplot(data = d, aes(x, y, label = lab, colour = group)) +
  geom_point() +
  stat_dens2d_labels(keep.fraction = 0.45)

When using geom "text" we can statically adjust the positioning of labels, but this is rarely enough even when keeping 1/4 of the labels.

ggplot(data = d, aes(x, y, label = lab, colour = group)) +
  geom_point() +
  stat_dens2d_labels(keep.fraction = 0.25,
                     vjust = -0.3)

Using the geoms from package ‘ggrepel’ avoids clashes among labels or on top of data points. This works with versions 0.6.0 and newer of ‘ggrepel’. One example with geom_text_repel follows.

ggplot(data = d, aes(x, y, label = lab, colour = group)) +
  geom_point() +
  stat_dens2d_labels(geom = "text_repel", 
                     keep.fraction = 0.45)

With geom_label_repel one needs to use a smaller value for keep.fracton, or a smaller size, as labels use more space on the plot than the test alone.

ggplot(data = d, aes(x, y, label = lab, colour = group)) +
  geom_point() +
  stat_dens2d_labels(geom = "label_repel", 
                     keep.fraction = 0.25)

Additional arguments can be used to change the angle and position of the text, but may give unexpected output when labels are long as the repulsion algorithm “sees” always a rectangular bounding box that is not rotated. With short labels or angles that are multiples of 90 degrees, there is no such problem.

ggplot(data = d, aes(x, y, label = lab, colour = group)) +
geom_point() +
stat_dens2d_labels(geom = "text_repel",
keep.fraction = 0.25, angle = 90)

Using NA as fill makes the observations with labels set to NA to be skipped completely, possibly leading to text overlapping the points corresponding to unlabelled observations around the boundary of the regions where labels are kept or discarded. We use here alpha so that the overlaps can be seen.

ggplot(data = d, aes(x, y, label = lab, colour = group)) +
  geom_point() +
  stat_dens2d_labels(geom = "label_repel", 
                     keep.fraction = 0.35, 
                     alpha = 0.5,
                     label.fill = NA)

try_tibble

Time series

Several different formats for storing time series data are used in R. Here we use in the examples objects of class ts but several other classes are supported as try.xts() is used internally. The first example is a quarterly series.

class(austres)
## [1] "ts"
austres.df <- try_tibble(austres)
class(austres.df)
## [1] "tbl_df"     "tbl"        "data.frame"
lapply(austres.df, "class")
## $time
## [1] "POSIXct" "POSIXt" 
## 
## $x
## [1] "numeric"
head(austres.df, 4)
## # A tibble: 4 x 2
##   time                     x
##   <dttm>               <dbl>
## 1 1971-04-01 00:00:00 13067.
## 2 1971-07-01 00:00:00 13130.
## 3 1971-10-01 00:00:00 13198.
## 4 1972-01-01 00:00:00 13254.

The next chunk demonstrates that numeric times are expressed as decimal years in the returned data frame.

austres.df <- try_tibble(austres, as.numeric = TRUE)
lapply(austres.df, "class")
## $time
## [1] "numeric"
## 
## $x
## [1] "numeric"
head(austres.df, 4)
## # A tibble: 4 x 2
##    time      x
##   <dbl>  <dbl>
## 1 1971. 13067.
## 2 1972. 13130.
## 3 1972. 13198.
## 4 1972  13254.

This second example is for a series of yearly values.

class(lynx)
## [1] "ts"
lynx.df <- try_tibble(lynx)
class(lynx.df)
## [1] "tbl_df"     "tbl"        "data.frame"
lapply(lynx.df, "class")
## $time
## [1] "POSIXct" "POSIXt" 
## 
## $x
## [1] "numeric"
head(lynx.df, 3)
## # A tibble: 3 x 2
##   time                    x
##   <dttm>              <dbl>
## 1 1820-02-01 00:00:00   269
## 2 1821-02-01 00:00:00   321
## 3 1822-02-01 00:00:00   585

Above there is a small rounding error of 1 s for these old dates. We can correct this by rounding to year.

lynx.df <- try_tibble(lynx, "year")
head(lynx.df, 3)
## # A tibble: 3 x 2
##   time                    x
##   <dttm>              <dbl>
## 1 1821-01-01 00:00:00   269
## 2 1822-01-01 00:00:00   321
## 3 1823-01-01 00:00:00   585

In addition we can convert the POSIXct values into numeric values in calendar years plus a decimal fraction.

lynx_n.df <- try_tibble(lynx, "year", as.numeric = TRUE)
lapply(lynx_n.df, "class")
## $time
## [1] "numeric"
## 
## $x
## [1] "numeric"
head(lynx_n.df, 3)
## # A tibble: 3 x 2
##    time     x
##   <dbl> <dbl>
## 1  1821   269
## 2  1822   321
## 3  1823   585

Other classes

try_tibble() attempts to handle gracefully objects that are not time series.

try_tibble(1:5)
## # A tibble: 5 x 1
##       x
##   <int>
## 1     1
## 2     2
## 3     3
## 4     4
## 5     5
try_tibble(letters[1:5])
## # A tibble: 5 x 1
##   x    
##   <chr>
## 1 a    
## 2 b    
## 3 c    
## 4 d    
## 5 e
try_tibble(factor(letters[1:5]))
## # A tibble: 5 x 1
##   x    
##   <fct>
## 1 a    
## 2 b    
## 3 c    
## 4 d    
## 5 e
try_tibble(list(x = rep(1,5), y = 1:5))
## # A tibble: 5 x 2
##       x     y
##   <dbl> <int>
## 1     1     1
## 2     1     2
## 3     1     3
## 4     1     4
## 5     1     5
try_tibble(data.frame(x = rep(1,5), y = 1:5))
## # A tibble: 5 x 2
##       x     y
##   <dbl> <int>
## 1     1     1
## 2     1     2
## 3     1     3
## 4     1     4
## 5     1     5
try_tibble(matrix(1:10, ncol = 2))
## # A tibble: 5 x 2
##      V1    V2
##   <int> <int>
## 1     1     6
## 2     2     7
## 3     3     8
## 4     4     9
## 5     5    10

Volcano plots

Volcano plots are scatter plots with some peculiarities. They are frequently used for transcriptomics and metabolomics. They are used to show P-values or FDR (false discovery rate) as a function of effect size, which can be either an increase or a decrease. Effect sizes are expressed as fold-changes compared to a control or reference condition. Colours are frequently used to highlight significant responses. Counts of significant increases and decreases are frequently also added.

Creating volcano plots from scratch using ‘ggplot2’ can be time consuming, but allows flexibility in design. Rather than providing a ‘canned’ function to produce volcano plots, package ‘ggpmisc’ provides several building blocks that facilitate the production of volcano and quadrant plots: Wrappers on scales with suitable defaults plus helper functions to create factors from numeric outcomes.

Outcomes encoded as -1, 0 or 1, as seen in the tibble below need to be converted into factors with suitable labels for levels. This can be easily achieved with function outcome2factor().

head(volcano_example.df) 
##         tag     gene outcome       logFC     PValue genotype
## 1 AT1G01040     ASU1       0 -0.15284466 0.35266997      Ler
## 2 AT1G01290     ASG4       0 -0.30057068 0.05471732      Ler
## 3 AT1G01560 ATSBT1.1       0 -0.57783350 0.06681310      Ler
## 4 AT1G01790   AtSAM1       0 -0.04729662 0.74054263      Ler
## 5 AT1G02130  AtTRM82       0 -0.14279891 0.29597519      Ler
## 6 AT1G02560    PRP39       0  0.23320752 0.07487043      Ler

Most frequently fold-change data is available log-transformed, using either 2 or 10 as base. In general it is more informative to use tick labels expressed as un-transformed fold-change.

By default scale_x_logFC() and scale_y_logFC() expect the logFC data log2-transformed, but this can be overridden. The default use of untransformed fold-change for tick labels can also be overridden. Scale scale_x_logFC() in addition by default expands the scale limits to make them symmetric around zero. If %unit is included in the name character string, suitable units are appended as shown in the example below.

Scales scale_y_Pvalue() and scale_x_Pvalue() have suitable defaults for name and labels, while scale_colour_outcome() provides suitable defaults for the colours mapped to the outcomes. To change the labels and title of the key or guide pass suitable arguments to parameters name and labels of these scales. These x and y scales by default squish off-limits (out-of-bounds) observations towards the limits.

volcano_example.df %>%
  mutate(., outcome.fct = outcome2factor(outcome)) %>%
  ggplot(., aes(logFC, PValue, colour = outcome.fct)) +
  geom_point() +
  scale_x_logFC(name = "Transcript abundance%unit") +
  scale_y_Pvalue() +
  scale_colour_outcome() +
  stat_quadrant_counts(data = . %>% filter(outcome != 0))

By default outcome2factor() creates a factor with three levels as in the example above, but this default can be overridden as shown below.

volcano_example.df %>%
  mutate(., outcome.fct = outcome2factor(outcome, n.levels = 2)) %>%
  ggplot(., aes(logFC, PValue, colour = outcome.fct)) +
  geom_point() +
  scale_x_logFC(name = "Transcript abundance%unit") +
  scale_y_Pvalue() +
  scale_colour_outcome() +
  stat_quadrant_counts(data = . %>% filter(outcome != 0))

Quadrant plots

Quadrant plots are scatter plots with comparable variables on both axes and usually include the four quadrants. When used for transcriptomics and metabolomics, they are used to compare responses expressed as fold-change to two different conditions or treatments. They are useful when many responses are measured as in transcriptomics (many different genes) or metabolomics (many different metabolites).

A single panel quadrant plot is as easy to produce as a volcano plot using scales scale_x_logFC() and scale_y_logFC(). The data includes two different outcomes and two different log fold-change variables. See the previous section on volcano plots for details. In this example we use as shape a filled circle and map one of the outcomes to colour and the other to fill, using the two matched scales scale_colour_outcome() and scale_fill_outcome().

head(quadrant_example.df)
##         tag          gene outcome.x outcome.y    logFC.x     logFC.y genotype
## 1 AT5G12290         AtMC9         0         0 -0.3226849  0.32887081      Ler
## 2 AT5G47040        ATFRO2         0         0 -0.1067945 -0.18828653      Ler
## 3 AT5G57560 14-3-3EPSILON         0         0  0.2232457  0.74447768      Ler
## 4 AT5G24110       ATPTEN1         0         0 -0.7253487  0.06952586      Ler
## 5 AT2G41290         RHF2A         0         0  0.4435049 -0.32347741      Ler
## 6 AT4G36650          GAE5         0         0 -0.1410215  0.18697968      Ler

In this plot we do not include those genes whose change in transcript abundance is uncertain under both x and y conditions.

quadrant_example.df %>%
  mutate(.,
         outcome.x.fct = outcome2factor(outcome.x),
         outcome.y.fct = outcome2factor(outcome.y),
         outcome.xy.fct = xy_outcomes2factor(outcome.x, outcome.y)) %>%
         filter(outcome.xy.fct != "none") %>%
  ggplot(., aes(logFC.x, logFC.y, colour = outcome.x.fct, fill = outcome.y.fct)) +
  geom_quadrant_lines(linetype = "dotted") +
  stat_quadrant_counts(size = 3, colour = "white") +
  geom_point(shape = "circle filled") +
  scale_x_logFC(name = "Transcript abundance for x%unit") +
  scale_y_logFC(name = "Transcript abundance for y%unit") +
  scale_colour_outcome() +
  scale_fill_outcome() +
  theme_dark()

To plot in separate panels those observations that are significant along both x and y axes, x axis, y axis, or none, with quadrants merged takes more effort. We first define two helper functions to add counts and quadrant lines to each of the four panels.

all_quadrant_counts <- function(...) {
  list(  
    stat_quadrant_counts(data = . %>% filter(outcome.xy.fct == "xy"), ...),
    stat_quadrant_counts(data = . %>% filter(outcome.xy.fct == "x"), pool.along = "y", ...),
    stat_quadrant_counts(data = . %>% filter(outcome.xy.fct == "y"), pool.along = "x", ...),
    stat_quadrant_counts(data = . %>% filter(outcome.xy.fct == "none"), quadrants = 0L, ...)
  )
}
all_quadrant_lines <- function(...) { 
  list(
    geom_hline(data =  data.frame(outcome.xy.fct = factor(c("xy", "x", "y", "none"),
                                                          levels = c("xy", "x", "y", "none")),
                                  yintercept = c(0, NA, 0, NA)),
               aes_(yintercept = ~yintercept),
               na.rm = TRUE,
               ...),
    geom_vline(data =  data.frame(outcome.xy.fct = factor(c("xy", "x", "y", "none"),
                                                          levels = c("xy", "x", "y", "none")),
                                  xintercept = c(0, 0, NA, NA)),
               aes_(xintercept = ~xintercept),
               na.rm = TRUE,
               ...)
  )
}

And use these functions to build the final plot, in this case including all genes.

quadrant_example.df %>%
  mutate(.,
         outcome.x.fct = outcome2factor(outcome.x),
         outcome.y.fct = outcome2factor(outcome.y),
         outcome.xy.fct = xy_outcomes2factor(outcome.x, outcome.y)) %>%
  ggplot(., aes(logFC.x, logFC.y, colour = outcome.x.fct, fill = outcome.y.fct)) +
  geom_point(shape = 21) +
  all_quadrant_lines(linetype = "dotted") +
  all_quadrant_counts(size = 3, colour = "white") +
  scale_x_logFC(name = "Transcript abundance for x%unit") +
  scale_y_logFC(name = "Transcript abundance for y%unit") +
  scale_colour_outcome() +
  scale_fill_outcome() +
  facet_wrap(~outcome.xy.fct) +
  theme_dark()

Appendix: Additional examples of the use density filtering

We define a function to simplify the generation of random data sets.

make_data_tbl <- function(nrow = 100, rfun = rnorm, ...) {
  if (nrow %% 2) {
    nrow <- nrow + 1
  }
  
  set.seed(1001)
  
  tibble::tibble(
    x = rfun(nrow, ...),
    y = rfun(nrow, ...),
    group = rep(c("A", "B"), c(nrow / 2, nrow / 2))
  )
}

As we will draw many points on the plotting area we change the default theme to an uncluttered one.

old_theme <- theme_set(theme_bw())

In all the examples in this vignette we use colours to demonstrate which data points are selected, but any other suitable aesthetic and discrete scale can be used instead. With keep.sparse = FALSE we keep 1/3 of the observations in the denser region of the plot. Although here we first plot all data points and later overplot the selected ones this is not necessary.

ggplot(data = make_data_tbl(300), aes(x, y)) +
  geom_point() +
  stat_dens2d_filter(colour = "red", 
                     keep.sparse = FALSE, 
                     keep.fraction = 1/3)

Here we highlight the observations split in three group equal groups, each of a different density of observations.

ggplot(data = make_data_tbl(300), aes(x, y)) +
  geom_point() +
  stat_dens2d_filter(colour = "red", 
                     keep.sparse = FALSE, 
                     keep.fraction = 1/3)+
  stat_dens2d_filter(colour = "blue", 
                     keep.fraction = 1/3)

The algorithm seems to work well also with other distributions, in this example the uniform distribution.

ggplot(data = make_data_tbl(300, rfun = runif), aes(x, y)) +
  geom_point() +
  stat_dens2d_filter(colour = "red", keep.fraction = 1/2)

One example with the gamma distribution, which is asymmetric.

ggplot(data = make_data_tbl(300, rfun = rgamma, shape = 2), 
       aes(x, y)) +
  geom_point() +
  stat_dens2d_filter(colour = "red", keep.fraction = 1/3)