{slider} is implemented with a new convention that began in {vctrs}, treating a data frame as a vector of rows. This makes slide()
a row-wise iterator over a data frame, which can be useful in solving some long standing problem points in the tidyverse.
The point of this vignette is to go through a few examples of a row-oriented workflow. The examples are adapted from Jenny Bryan’s talk of row-oriented workflows with purrr, to show how this workflow is improved with slide()
.
Let’s first explore using slide()
as a row wise iterator in general. We’ll start with this simple data frame.
example <- tibble(
x = 1:4,
y = letters[1:4]
)
example
#> # A tibble: 4 x 2
#> x y
#> <int> <chr>
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
If we were to pass the x
column to slide()
, it would iterate over that using the window specified by .before
, .after
, and .complete
. The defaults look to be similar to purrr::map()
.
slide(example$x, ~.x)
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 2
#>
#> [[3]]
#> [1] 3
#>
#> [[4]]
#> [1] 4
slide(example$x, ~.x, .before = 2)
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 1 2
#>
#> [[3]]
#> [1] 1 2 3
#>
#> [[4]]
#> [1] 2 3 4
When applied to the entire example
data frame, map()
treats it as a list and iterates over the columns. slide()
, on the other hand, iterates over rows. This is consistent with the vctrs idea of size, which is the length of an atomic vector, but the number of rows of a data frame or matrix. slide()
always returns an object with the same size as its input. Because the number of rows in example
is 4, the output size is 4.
slide(example, ~.x)
#> [[1]]
#> # A tibble: 1 x 2
#> x y
#> <int> <chr>
#> 1 1 a
#>
#> [[2]]
#> # A tibble: 1 x 2
#> x y
#> <int> <chr>
#> 1 2 b
#>
#> [[3]]
#> # A tibble: 1 x 2
#> x y
#> <int> <chr>
#> 1 3 c
#>
#> [[4]]
#> # A tibble: 1 x 2
#> x y
#> <int> <chr>
#> 1 4 d
You can still use the other arguments to slide()
to control the window size.
# Current row + 2 before
slide(example, ~.x, .before = 2)
#> [[1]]
#> # A tibble: 1 x 2
#> x y
#> <int> <chr>
#> 1 1 a
#>
#> [[2]]
#> # A tibble: 2 x 2
#> x y
#> <int> <chr>
#> 1 1 a
#> 2 2 b
#>
#> [[3]]
#> # A tibble: 3 x 2
#> x y
#> <int> <chr>
#> 1 1 a
#> 2 2 b
#> 3 3 c
#>
#> [[4]]
#> # A tibble: 3 x 2
#> x y
#> <int> <chr>
#> 1 2 b
#> 2 3 c
#> 3 4 d
# Center aligned, with no partial results
slide(example, ~.x, .before = 1, .after = 1, .complete = TRUE)
#> [[1]]
#> NULL
#>
#> [[2]]
#> # A tibble: 3 x 2
#> x y
#> <int> <chr>
#> 1 1 a
#> 2 2 b
#> 3 3 c
#>
#> [[3]]
#> # A tibble: 3 x 2
#> x y
#> <int> <chr>
#> 1 2 b
#> 2 3 c
#> 3 4 d
#>
#> [[4]]
#> NULL
Often, using slide()
with its defaults will be enough, as its common to iterate over just one row at a time.
A nice use of a tibble is as a structured way to store parameter combinations. For example, we could store multiple rows of parameter combinations where each row could be supplied to runif()
to generate different types of uniform random variables.
parameters <- tibble(
n = 1:3,
min = c(0, 10, 100),
max = c(1, 100, 1000)
)
parameters
#> # A tibble: 3 x 3
#> n min max
#> <int> <dbl> <dbl>
#> 1 1 0 1
#> 2 2 10 100
#> 3 3 100 1000
With slide()
you can pass these parameters on to runif()
by iterating over parameters
row-wise, giving you access to .x
inside of the function, which is a data frame of the current row. Because it is a data frame, you have access to each column by name. Notice how there is no restriction that the columns of the data frame be the same as the argument names of runif()
.
For these examples, we will consider a company
data set containing the day
a sale was made, the number of calls, n_calls
, that were placed on that day, and the number of sales
that resulted from those calls.
company <- tibble(
day = rep(c(1, 2), each = 5),
sales = sample(100, 10),
n_calls = sales + sample(1000, 10)
)
company
#> # A tibble: 10 x 3
#> day sales n_calls
#> <dbl> <int> <int>
#> 1 1 50 1039
#> 2 1 43 398
#> 3 1 14 854
#> 4 1 25 51
#> 5 1 90 609
#> 6 2 91 517
#> 7 2 69 718
#> 8 2 95 861
#> 9 2 57 268
#> 10 2 9 941
When slide()
-ing inside of a mutate()
call, there are a few scenarios that can arise. First, you might want to slide over a single column. This is easy enough in both the ungrouped and grouped case.
company %>%
mutate(sales_roll = slide_dbl(sales, mean, .before = 2, .complete = TRUE))
#> # A tibble: 10 x 4
#> day sales n_calls sales_roll
#> <dbl> <int> <int> <dbl>
#> 1 1 50 1039 NA
#> 2 1 43 398 NA
#> 3 1 14 854 35.7
#> 4 1 25 51 27.3
#> 5 1 90 609 43
#> 6 2 91 517 68.7
#> 7 2 69 718 83.3
#> 8 2 95 861 85
#> 9 2 57 268 73.7
#> 10 2 9 941 53.7
company %>%
group_by(day) %>%
mutate(sales_roll = slide_dbl(sales, mean, .before = 2, .complete = TRUE))
#> # A tibble: 10 x 4
#> # Groups: day [2]
#> day sales n_calls sales_roll
#> <dbl> <int> <int> <dbl>
#> 1 1 50 1039 NA
#> 2 1 43 398 NA
#> 3 1 14 854 35.7
#> 4 1 25 51 27.3
#> 5 1 90 609 43
#> 6 2 91 517 NA
#> 7 2 69 718 NA
#> 8 2 95 861 85
#> 9 2 57 268 73.7
#> 10 2 9 941 53.7
If the function you want to apply when sliding takes a data frame as input, things get more complicated. One way to accomplish this is by utilizing the fact that you have access to .
in the magrittr %>%
. As an example, imagine you want to perform a rolling regression with sales
as your outcome and n_calls
as a predictor.
company %>%
mutate(
regressions = slide(
.x = .,
.f = ~lm(sales ~ n_calls, .x),
.before = 2,
.complete = TRUE
)
)
#> # A tibble: 10 x 4
#> day sales n_calls regressions
#> <dbl> <int> <int> <list>
#> 1 1 50 1039 <NULL>
#> 2 1 43 398 <NULL>
#> 3 1 14 854 <lm>
#> 4 1 25 51 <lm>
#> 5 1 90 609 <lm>
#> 6 2 91 517 <lm>
#> 7 2 69 718 <lm>
#> 8 2 95 861 <lm>
#> 9 2 57 268 <lm>
#> 10 2 9 941 <lm>
But here be dragons! The .
you have access to has two problems.
If you try and add columns in the mutate()
and expect them to be available to you in .
, you will be upset. In this case, lm()
couldn’t find log_n_calls
in the slice of .
available through .x
, so it looked in the surrounding environment and found the entire length 10 log_n_calls
vector that we created and tried to pass that through to the regression.
company %>%
mutate(
log_n_calls = log(n_calls),
regressions = slide(
.x = .,
.f = ~lm(sales ~ log_n_calls, .x),
.before = 2,
.complete = TRUE
)
)
#> Error: Problem with `mutate()` input `regressions`.
#> x variable lengths differ (found for 'log_n_calls')
#> ℹ Input `regressions` is `slide(...)`.
To show this a bit simpler, let’s just try and access that log_n_calls
column inside our slide function. We immediately get a slew of warnings because it doesn’t exist.
company %>%
mutate(
log_n_calls = log(n_calls),
example = slide(., ~.x$log_n_calls)
)
#> Warning: Problem with `mutate()` input `example`.
#> x Unknown or uninitialised column: `log_n_calls`.
#> ℹ Input `example` is `slide(., ~.x$log_n_calls)`.
#> Warning: Unknown or uninitialised column: `log_n_calls`.
#> Warning: Problem with `mutate()` input `example`.
#> x Unknown or uninitialised column: `log_n_calls`.
#> ℹ Input `example` is `slide(., ~.x$log_n_calls)`.
#> Warning: Unknown or uninitialised column: `log_n_calls`.
#> Warning: Problem with `mutate()` input `example`.
#> x Unknown or uninitialised column: `log_n_calls`.
#> ℹ Input `example` is `slide(., ~.x$log_n_calls)`.
#> Warning: Unknown or uninitialised column: `log_n_calls`.
#> Warning: Problem with `mutate()` input `example`.
#> x Unknown or uninitialised column: `log_n_calls`.
#> ℹ Input `example` is `slide(., ~.x$log_n_calls)`.
#> Warning: Unknown or uninitialised column: `log_n_calls`.
#> Warning: Problem with `mutate()` input `example`.
#> x Unknown or uninitialised column: `log_n_calls`.
#> ℹ Input `example` is `slide(., ~.x$log_n_calls)`.
#> Warning: Unknown or uninitialised column: `log_n_calls`.
#> Warning: Problem with `mutate()` input `example`.
#> x Unknown or uninitialised column: `log_n_calls`.
#> ℹ Input `example` is `slide(., ~.x$log_n_calls)`.
#> Warning: Unknown or uninitialised column: `log_n_calls`.
#> Warning: Problem with `mutate()` input `example`.
#> x Unknown or uninitialised column: `log_n_calls`.
#> ℹ Input `example` is `slide(., ~.x$log_n_calls)`.
#> Warning: Unknown or uninitialised column: `log_n_calls`.
#> Warning: Problem with `mutate()` input `example`.
#> x Unknown or uninitialised column: `log_n_calls`.
#> ℹ Input `example` is `slide(., ~.x$log_n_calls)`.
#> Warning: Unknown or uninitialised column: `log_n_calls`.
#> Warning: Problem with `mutate()` input `example`.
#> x Unknown or uninitialised column: `log_n_calls`.
#> ℹ Input `example` is `slide(., ~.x$log_n_calls)`.
#> Warning: Unknown or uninitialised column: `log_n_calls`.
#> Warning: Problem with `mutate()` input `example`.
#> x Unknown or uninitialised column: `log_n_calls`.
#> ℹ Input `example` is `slide(., ~.x$log_n_calls)`.
#> Warning: Unknown or uninitialised column: `log_n_calls`.
#> # A tibble: 10 x 5
#> day sales n_calls log_n_calls example
#> <dbl> <int> <int> <dbl> <list>
#> 1 1 50 1039 6.95 <NULL>
#> 2 1 43 398 5.99 <NULL>
#> 3 1 14 854 6.75 <NULL>
#> 4 1 25 51 3.93 <NULL>
#> 5 1 90 609 6.41 <NULL>
#> 6 2 91 517 6.25 <NULL>
#> 7 2 69 718 6.58 <NULL>
#> 8 2 95 861 6.76 <NULL>
#> 9 2 57 268 5.59 <NULL>
#> 10 2 9 941 6.85 <NULL>
Even if you don’t create new columns in your mutate()
, there is a high chance that you’ll use dplyr for its group_by()
capability. If you try and use .
with a grouped data frame, you’ll have issues as well because .
won’t correspond to the current group, but will instead be the entire data frame.
company %>%
group_by(day) %>%
mutate(
regressions = slide(
.x = .,
.f = ~lm(sales ~ n_calls, .x),
.before = 2,
.complete = TRUE
)
)
#> Error: Problem with `mutate()` input `regressions`.
#> x Input `regressions` can't be recycled to size 5.
#> ℹ Input `regressions` is `slide(.x = ., .f = ~lm(sales ~ n_calls, .x), .before = 2, .complete = TRUE)`.
#> ℹ Input `regressions` must be size 5 or 1, not 10.
#> ℹ The error occurred in group 1: day = 1.
The “problem” is that currently dplyr does not give us any way to easily access the current data frame that we are processing. We can access individual columns by name, but the entire data frame object is impossible to get access to. I’m optimistic that this will get easier in the coming months, but in the mean time here are a few solutions that you can use.
Rather than trying to pass the data frame on with .
, you can construct it on the fly from the individual vectors that you do have access to. This means you would have access to any columns that were created in the same mutate()
call.
company %>%
mutate(
log_n_calls = log(n_calls),
regressions = slide(
.x = tibble(sales = sales, log_n_calls = log_n_calls),
.f = ~lm(sales ~ log_n_calls, .x),
.before = 2,
.complete = TRUE
)
)
#> # A tibble: 10 x 5
#> day sales n_calls log_n_calls regressions
#> <dbl> <int> <int> <dbl> <list>
#> 1 1 50 1039 6.95 <NULL>
#> 2 1 43 398 5.99 <NULL>
#> 3 1 14 854 6.75 <lm>
#> 4 1 25 51 3.93 <lm>
#> 5 1 90 609 6.41 <lm>
#> 6 2 91 517 6.25 <lm>
#> 7 2 69 718 6.58 <lm>
#> 8 2 95 861 6.76 <lm>
#> 9 2 57 268 5.59 <lm>
#> 10 2 9 941 6.85 <lm>
This also works for the grouped example.
company %>%
group_by(day) %>%
mutate(
regressions = slide(
.x = tibble(sales = sales, n_calls = n_calls),
.f = ~lm(sales ~ n_calls, .x),
.before = 2,
.complete = TRUE
)
)
#> # A tibble: 10 x 4
#> # Groups: day [2]
#> day sales n_calls regressions
#> <dbl> <int> <int> <list>
#> 1 1 50 1039 <NULL>
#> 2 1 43 398 <NULL>
#> 3 1 14 854 <lm>
#> 4 1 25 51 <lm>
#> 5 1 90 609 <lm>
#> 6 2 91 517 <NULL>
#> 7 2 69 718 <NULL>
#> 8 2 95 861 <lm>
#> 9 2 57 268 <lm>
#> 10 2 9 941 <lm>
Depending on your comfort with the new dplyr::group_modify()
function, it can provide a solution that doesn’t require you to explicitly create a tibble of the columns you require. group_modify()
allows us to apply a function on each group, giving us access to the current data frame as the first argument to the function, and the “key” as the second argument. We don’t need to worry about the key for this example, but we need to “swallow” it with ...
in our function.
The way I tackle these problems is to construct a function that works on one data frame group, then apply it to all of them.
single_group_regressions <- function(.data_group, ...) {
regressions <- slide(
.x = .data_group,
.f = ~lm(sales ~ n_calls, .x),
.before = 2,
.complete = TRUE
)
mutate(.data_group, regressions = regressions)
}
Test it on one group.
day_one <- filter(company, day == 1)
single_group_regressions(day_one)
#> # A tibble: 5 x 4
#> day sales n_calls regressions
#> <dbl> <int> <int> <list>
#> 1 1 50 1039 <NULL>
#> 2 1 43 398 <NULL>
#> 3 1 14 854 <lm>
#> 4 1 25 51 <lm>
#> 5 1 90 609 <lm>
Now apply it to all groups with group_modify()
!
company %>%
group_by(day) %>%
group_modify(single_group_regressions)
#> # A tibble: 10 x 4
#> # Groups: day [2]
#> day sales n_calls regressions
#> <dbl> <int> <int> <list>
#> 1 1 50 1039 <NULL>
#> 2 1 43 398 <NULL>
#> 3 1 14 854 <lm>
#> 4 1 25 51 <lm>
#> 5 1 90 609 <lm>
#> 6 2 91 517 <NULL>
#> 7 2 69 718 <NULL>
#> 8 2 95 861 <lm>
#> 9 2 57 268 <lm>
#> 10 2 9 941 <lm>