User Guide: 2 Manipulating ggplots

‘gginnards’ 0.0.3

Pedro J. Aphalo

2019-11-26

Introduction

The functions described here are not expected to be useful in everyday plotting as when using the grammar of graphics one can simply change the order in which layers are added to a ggplot, or remove unused variables from the data before passing it as argument to the ggplot() constructor.

However, if one uses high level methods like autoplot() or other functions that automatically produce a full plot using ‘ggplot2’ internally, one may need to add, move or delete layers so as to profit from such canned methods and retain enough flexibility.

Some time ago I needed to manipulate the layers of a ggplot, and found a matching question in Stackoverflow. I used the answers found in Stackoverflow as the starting point for writing the functions described in the first part of this vignette.

In a ggplot object, layers reside in a list, and their positions in the list determine the plotting order when generating the graphical output. The grammar of graphics treats the list of layers as a stack using only push operations. In other words, always the most recently added layer resides at the end of the list, and during rendering over-plots all layers previously added. The functions described in this vignette allow overriding the normal syntax at the cost of breaking the expectations of the grammar. These functions are, as told above, to be used only in exceptional cases. This notwithstanding, they are rather easy to use and the user interface is consistent across all of them. Moreover, they are designed to return objects that are identical to objects created using the normal syntax rules of the grammar of graphics. The table below list the names and purpose of these functions.

Function Use
delete_layers() delete one or more layers
append_layers() append layers at a specific position
move_layers() move layers to an absolute position
shift_layers() move layers to a relative position
which_layers() obtain the index positions of layers
extract_layers() extract matched or indexed layers
num_layers() obtain number of layers
top_layer() obtain position of top layer
bottom_layer() obtain position of bottom layer

Although their definitions do not rely on code internal to ‘ggplot2’, they rely on the internal structure of objects belonging to class gg and ggplot. Consequently, long-term backwards and forward compatibility cannot be guaranteed, or even expected.

Preliminaries

library(ggplot2)
library(gginnards)
library(tibble)
library(magrittr)
library(stringr)

We generate some artificial data and create a data frame with them.

set.seed(4321)
# generate artificial data
my.data <- data.frame(
  group = factor(rep(letters[1:4], each = 30)),
  panel = factor(rep(LETTERS[1:2], each = 60)),
  y = rnorm(40),
  unused = "garbage"
)

We add attributes to the data frame with the fake data.

attr(my.data, "my.atr.char") <- "my.atr.value"
attr(my.data, "my.atr.num") <- 12345678

We change the default theme to an uncluttered one.

old_theme <- theme_set(theme_bw())

We generate a plot to be used later to demonstrate the use of the functions.

p <- ggplot(my.data, aes(group, y)) + 
  geom_point() +
  stat_summary(fun.data = mean_se, colour = "cornflowerblue", size = 1.3) +
  facet_wrap(~panel, scales = "free_x", labeller = label_both)
p

Exploring how ggplots are stored

To display summary textual information about a gg object we use method summary() from package ‘ggplot2’, while methods print() and plot() will display the actual plot.

summary(p)
## data: group, panel, y, unused [120x4]
## mapping:  x = ~group, y = ~y
## faceting: <ggproto object: Class FacetWrap, Facet, gg>
##     compute_layout: function
##     draw_back: function
##     draw_front: function
##     draw_labels: function
##     draw_panels: function
##     finish_data: function
##     init_scales: function
##     map_data: function
##     params: list
##     setup_data: function
##     setup_params: function
##     shrink: TRUE
##     train_scales: function
##     vars: function
##     super:  <ggproto object: Class FacetWrap, Facet, gg>
## -----------------------------------
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity 
## 
## geom_pointrange: na.rm = FALSE
## stat_summary: fun.data = function (x, mult = 1) 
## {
##     x <- stats::na.omit(x)
##     se <- mult * sqrt(stats::var(x)/length(x))
##     mean <- mean(x)
##     new_data_frame(list(y = mean, ymin = mean - se, ymax = mean + se), n = 1)
## }, fun.y = NULL, fun.ymax = NULL, fun.ymin = NULL, fun.args = list(), na.rm = FALSE
## position_identity

Layers in a ggplot object are stored in a list as nameless members. This means that they have to be accessed using numerical indexes, and that we need to use some indirect way of finding the indexes corresponding to the layers of interest.

names(p$layers)
## NULL

The output of summary() is compact.

summary(p$layers)
##      Length Class         Mode       
## [1,] 11     LayerInstance environment
## [2,] 11     LayerInstance environment

The default print() method for a list of layers displays only a small part of the information in a layer.

print(p$layers)
## [[1]]
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity 
## 
## [[2]]
## geom_pointrange: na.rm = FALSE
## stat_summary: fun.data = function (x, mult = 1) 
## {
##     x <- stats::na.omit(x)
##     se <- mult * sqrt(stats::var(x)/length(x))
##     mean <- mean(x)
##     new_data_frame(list(y = mean, ymin = mean - se, ymax = mean + se), n = 1)
## }, fun.y = NULL, fun.ymax = NULL, fun.ymin = NULL, fun.args = list(), na.rm = FALSE
## position_identity

To see all the fields, we need to use str(), which we use here for a single layer.

str(p$layers[[1]])
## Classes 'LayerInstance', 'Layer', 'ggproto', 'gg' <ggproto object: Class LayerInstance, Layer, gg>
##     aes_params: list
##     compute_aesthetics: function
##     compute_geom_1: function
##     compute_geom_2: function
##     compute_position: function
##     compute_statistic: function
##     data: waiver
##     draw_geom: function
##     finish_statistics: function
##     geom: <ggproto object: Class GeomPoint, Geom, gg>
##         aesthetics: function
##         default_aes: uneval
##         draw_group: function
##         draw_key: function
##         draw_layer: function
##         draw_panel: function
##         extra_params: na.rm
##         handle_na: function
##         non_missing_aes: size shape colour
##         optional_aes: 
##         parameters: function
##         required_aes: x y
##         setup_data: function
##         use_defaults: function
##         super:  <ggproto object: Class Geom, gg>
##     geom_params: list
##     inherit.aes: TRUE
##     layer_data: function
##     map_statistic: function
##     mapping: NULL
##     position: <ggproto object: Class PositionIdentity, Position, gg>
##         compute_layer: function
##         compute_panel: function
##         required_aes: 
##         setup_data: function
##         setup_params: function
##         super:  <ggproto object: Class Position, gg>
##     print: function
##     setup_layer: function
##     show.legend: NA
##     stat: <ggproto object: Class StatIdentity, Stat, gg>
##         aesthetics: function
##         compute_group: function
##         compute_layer: function
##         compute_panel: function
##         default_aes: uneval
##         extra_params: na.rm
##         finish_layer: function
##         non_missing_aes: 
##         parameters: function
##         required_aes: 
##         retransform: TRUE
##         setup_data: function
##         setup_params: function
##         super:  <ggproto object: Class Stat, gg>
##     stat_params: list
##     super:  <ggproto object: Class Layer, gg>

Manipulation of plot layers

We start by using which_layers() as it produces simply a vector of indexes into the list of layers. The third statement is useless here, but demonstrates how layers are selected in all the functions described in this document. We can see that each layer, as described in the first volume of this User Guide, contains one geometry and one statistic.

which_layers(p, "GeomPoint")
## [1] 1
which_layers(p, "StatIdentity")
## [1] 1
which_layers(p, "GeomPointrange")
## [1] 2
which_layers(p, "StatSummary")
## [1] 2
which_layers(p, idx = 1L)
## [1] 1

We can also easily extract matching layers with extract_layers(). Here one layer is returned, and displayed using the default print() method. Method str() can also be used as shown above.

extract_layers(p, "GeomPoint")
## [[1]]
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity

With delete_layers() we can remove layers from a plot, selecting them using the match to a class, as shown here, or by a positional index as shown next.

delete_layers(p, "GeomPoint")

delete_layers(p, idx = 1L)

delete_layers(p, "StatSummary")

With move_layers() we can alter the stacking order of layers. The layers to move are selected in the same way as in the examples above, while position gives where to move the layers to. Two character strings, "top" and "bottom" are accepted as position argument, as well as integers. In the later case, the layer(s) is/are appended after the supplied position with reference to the list of layers not being moved.

move_layers(p, "GeomPoint", position = "top")

The equivalent operation using a relative position. A positive value for shift is interpreted as an upward displacement and a negative one as downwards displacement.

shift_layers(p, "GeomPoint", shift = +1)

Here we show how to add a layer behind all other layers.

append_layers(p, geom_line(colour = "orange", size = 1), position = "bottom")

It is also possible to append the new layer immediately above an arbitrary existing layer using a numeric index, which as shown here can be also obtained by matching to a class name. In this example we insert a new layer in-between two layers already present in the plot. As with the + operator of the Grammar of Graphics, object also accepts a list of layers as argument (no example shown).

append_layers(p, object = geom_line(colour = "orange", size = 1), 
              position = which_layers(p, "GeomPoint"))

Annotations add layers, so they can be manipulated in the same way as other layers.

p1 <- p + 
  annotate("text", label = "text label", x = 1.1, y = 0, hjust = 0)
p1

delete_layers(p1, "GeomText")

Replacing scales, coordinates, whole themes and data.

Elements that are normally added to a ggplot with operator +, such as scales, themes, aesthetics can be replaced with the %+% operator. The situation with layers is different as a plot may contain multiple layers and layers are nameless. With layers %+% is not a replacement operator.

num_layers(p)
## [1] 2
num_layers(p %+% geom_point(colour = "blue"))
## [1] 3
num_layers(p + geom_point(colour = "blue"))
## [1] 3
p1 <- p + theme_bw()
p1
p1 + theme_void()
p1 %+% theme_void()

Editing theme elements

Method summary() is available for themes.

summary(theme_bw())

However, to see the actual values stored, we need to use str(). To avoid excessive output we first find the names for the elements of the theme and then look as how the default text settings are stored.

names(theme_bw())
str(theme_bw()$text)

Themes can be modified using theme(). See the ‘ggplot2’ documentation for details.

Removing unused data

The argument passed through data to ggplot() or a layer is stored in whole in the ggplot object, even the data columns not mapped to any aesthetic. In most cases this does not matter, but in the case of huge datasets, the use of RAM and disk space can add up, and occasionally printing of each plot can slow down. The reason for storing the whole data set is that it is always possible to add layers with the grammar of graphics to an existing plot and consequently only the user can know which variables can be removed or not.

One obvious way of not storing unused data in ggplot objects is for the user to select the required variables and pass only these to the ggplot() constructor or layers. A less efficient alternative, but possibly easier to use for some users, is for users to drop the unused variables when they consider that a plot is ready. We show here how to do this, with a function that started as a self-imposed exercise.

To simplify the embedded data objects we need to find which variables are mapped to aesthetics and which are not. Here is a naive attempt at handling the possibility of mappings to expressions involving computations and multiple variables per mapping, and facets. This is naive in that it ignores mapping within layers and variables used for faceting.

mapped.vars <- 
  gsub("[~*\\%^]", " ", as.character(p$mapping)) %>%
  str_split(boundary("word")) %>%
  unlist() %>%
  c(names(p$facet$params$facets))

We need also to find which variables are present in the data.

data.vars <- names(p$data)

Next we identify which variables in data are not used, and delete them.

unused.vars <- setdiff(data.vars, c(mapped.vars))
keep.idxs <- which(!data.vars %in% unused.vars)
p1 <- p
p1$data <- p$data[ , keep.idxs]

For a data set this small, removing a single column saves very little space.

object.size(my.data)
## 5488 bytes
object.size(p)
## 11784 bytes
object.size(p1)
## 10352 bytes
names(my.data)
## [1] "group"  "panel"  "y"      "unused"
names(p$data)
## [1] "group"  "panel"  "y"      "unused"
names(p1$data)
## [1] "group" "panel" "y"

The plot has not changed.

p1

We can assemble all the code into a function for convenience, and expand the code to also recognize mappings within layers and variables used in faceting. Such a function, only cursorily tested is included in the package as drop_vars(). Given its design the most likely failure mode is keeping too many variables rather than removing too many.

drop_vars(p)

When saving ggplot objects to disk avoiding to carry along unused data can be beneficial. Of course, removing unused data means that they will not be available at a later time if we want to add more layers to the same saved ggplot object.

It was not clear to me when R does make a copy of the data embedded in a ggplot object and when not. R’s policy is to copy data objects lazily, or only when modified. Does the ‘ggplot2’ code modify the argument passed to its data parameter triggering a real copy operation or not. We can check this with the help of package ‘pryr’.

pryr::address(my.data)
## Registered S3 method overwritten by 'pryr':
##   method      from
##   print.bytes Rcpp
## [1] "0x1c441fb0"
z <- p$data
pryr::address(z)
## [1] "0x1c441fb0"

In this case, R has not created a copy. So, from the point of view of total memory usage, deleting the unused columns in p is not always beneficial. If the object is saved to disk or my.data modified in any way after p was created a copy of my.data will be created at this later time. In this simple example we modify the value of an attribute.

attr(my.data, "my.atr.num") <- 1324567
pryr::address(z)
## [1] "0x1c441fb0"
pryr::address(my.data)
## [1] "0x1cccf460"

Attributes of the embedded data object

‘ggplot2’ version 3.1.0 and later preserves most attributes of the object passed as argument to the data parameter of the ggplot() constructor. The class of the object seems to be modified if it is derived from data frame or tibble, but other attributes are retained in the copy stored in the gg object.

data_attributes(p)
## $names
## [1] "group"  "panel"  "y"      "unused"
## 
## $class
## [1] "data.frame"
## 
## $row.names
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
## [109] 109 110 111 112 113 114 115 116 117 118 119 120
## 
## $my.atr.char
## [1] "my.atr.value"
## 
## $my.atr.num
## [1] 12345678

Another interesting question is whether these user attributes are copied when data are passed to geometries and statistics. We can find out with geom_debug() that they are not.

p + geom_debug(summary.fun = attributes)
## $names
## [1] "x"     "y"     "PANEL" "group"
## 
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## [51] 51 52 53 54 55 56 57 58 59 60
## 
## $class
## [1] "data.frame"
## 
## $names
## [1] "x"     "y"     "PANEL" "group"
## 
## $row.names
##  [1]  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79
## [20]  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98
## [39]  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117
## [58] 118 119 120
## 
## $class
## [1] "data.frame"

Coda

The are many other things that we could explore about ggplot objects, but a package to be submitted to CRAN cannot have too many pages of documentation, so we hope this package and its documentation can serve as a starting point for further exploration.