Row-wise iteration with slider

library(slider)
library(dplyr, warn.conflicts = FALSE)

{slider} is implemented with a new convention that began in {vctrs}, treating a data frame as a vector of rows. This makes slide() a row-wise iterator over a data frame, which can be useful in solving some long standing problem points in the tidyverse.

The point of this vignette is to go through a few examples of a row-oriented workflow. The examples are adapted from Jenny Bryan’s talk of row-oriented workflows with purrr, to show how this workflow is improved with slide().

Row-wise iteration

Let’s first explore using slide() as a row wise iterator in general. We’ll start with this simple data frame.

example <- tibble(
  x = 1:4,
  y = letters[1:4]
)

example
#> # A tibble: 4 x 2
#>       x y    
#>   <int> <chr>
#> 1     1 a    
#> 2     2 b    
#> 3     3 c    
#> 4     4 d

If we were to pass the x column to slide(), it would iterate over that using the window specified by .before, .after, and .complete. The defaults look to be similar to purrr::map().

slide(example$x, ~.x)
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] 2
#> 
#> [[3]]
#> [1] 3
#> 
#> [[4]]
#> [1] 4

slide(example$x, ~.x, .before = 2)
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] 1 2
#> 
#> [[3]]
#> [1] 1 2 3
#> 
#> [[4]]
#> [1] 2 3 4

When applied to the entire example data frame, map() treats it as a list and iterates over the columns. slide(), on the other hand, iterates over rows. This is consistent with the vctrs idea of size, which is the length of an atomic vector, but the number of rows of a data frame or matrix. slide() always returns an object with the same size as its input. Because the number of rows in example is 4, the output size is 4.

slide(example, ~.x)
#> [[1]]
#> # A tibble: 1 x 2
#>       x y    
#>   <int> <chr>
#> 1     1 a    
#> 
#> [[2]]
#> # A tibble: 1 x 2
#>       x y    
#>   <int> <chr>
#> 1     2 b    
#> 
#> [[3]]
#> # A tibble: 1 x 2
#>       x y    
#>   <int> <chr>
#> 1     3 c    
#> 
#> [[4]]
#> # A tibble: 1 x 2
#>       x y    
#>   <int> <chr>
#> 1     4 d

You can still use the other arguments to slide() to control the window size.

# Current row + 2 before
slide(example, ~.x, .before = 2)
#> [[1]]
#> # A tibble: 1 x 2
#>       x y    
#>   <int> <chr>
#> 1     1 a    
#> 
#> [[2]]
#> # A tibble: 2 x 2
#>       x y    
#>   <int> <chr>
#> 1     1 a    
#> 2     2 b    
#> 
#> [[3]]
#> # A tibble: 3 x 2
#>       x y    
#>   <int> <chr>
#> 1     1 a    
#> 2     2 b    
#> 3     3 c    
#> 
#> [[4]]
#> # A tibble: 3 x 2
#>       x y    
#>   <int> <chr>
#> 1     2 b    
#> 2     3 c    
#> 3     4 d

# Center aligned, with no partial results
slide(example, ~.x, .before = 1, .after = 1, .complete = TRUE)
#> [[1]]
#> NULL
#> 
#> [[2]]
#> # A tibble: 3 x 2
#>       x y    
#>   <int> <chr>
#> 1     1 a    
#> 2     2 b    
#> 3     3 c    
#> 
#> [[3]]
#> # A tibble: 3 x 2
#>       x y    
#>   <int> <chr>
#> 1     2 b    
#> 2     3 c    
#> 3     4 d    
#> 
#> [[4]]
#> NULL

Often, using slide() with its defaults will be enough, as its common to iterate over just one row at a time.

Calling functions with various parameter combinations

A nice use of a tibble is as a structured way to store parameter combinations. For example, we could store multiple rows of parameter combinations where each row could be supplied to runif() to generate different types of uniform random variables.

parameters <- tibble(
  n = 1:3,
  min = c(0, 10, 100),
  max = c(1, 100, 1000)
)

parameters
#> # A tibble: 3 x 3
#>       n   min   max
#>   <int> <dbl> <dbl>
#> 1     1     0     1
#> 2     2    10   100
#> 3     3   100  1000

With slide() you can pass these parameters on to runif() by iterating over parameters row-wise, giving you access to .x inside of the function, which is a data frame of the current row. Because it is a data frame, you have access to each column by name. Notice how there is no restriction that the columns of the data frame be the same as the argument names of runif().

set.seed(123)

slide(parameters, ~runif(.x$n, .x$min, .x$max))
#> [[1]]
#> [1] 0.2875775
#> 
#> [[2]]
#> [1] 80.94746 46.80792
#> 
#> [[3]]
#> [1] 894.7157 946.4206 141.0008

Sliding inside a mutate()

For these examples, we will consider a company data set containing the day a sale was made, the number of calls, n_calls, that were placed on that day, and the number of sales that resulted from those calls.

company <- tibble(
  day = rep(c(1, 2), each = 5),
  sales = sample(100, 10),
  n_calls = sales + sample(1000, 10)
)

company
#> # A tibble: 10 x 3
#>      day sales n_calls
#>    <dbl> <int>   <int>
#>  1     1    50    1039
#>  2     1    43     398
#>  3     1    14     854
#>  4     1    25      51
#>  5     1    90     609
#>  6     2    91     517
#>  7     2    69     718
#>  8     2    95     861
#>  9     2    57     268
#> 10     2     9     941

When slide()-ing inside of a mutate() call, there are a few scenarios that can arise. First, you might want to slide over a single column. This is easy enough in both the ungrouped and grouped case.

company %>%
  mutate(sales_roll = slide_dbl(sales, mean, .before = 2, .complete = TRUE))
#> # A tibble: 10 x 4
#>      day sales n_calls sales_roll
#>    <dbl> <int>   <int>      <dbl>
#>  1     1    50    1039       NA  
#>  2     1    43     398       NA  
#>  3     1    14     854       35.7
#>  4     1    25      51       27.3
#>  5     1    90     609       43  
#>  6     2    91     517       68.7
#>  7     2    69     718       83.3
#>  8     2    95     861       85  
#>  9     2    57     268       73.7
#> 10     2     9     941       53.7

company %>%
  group_by(day) %>%
  mutate(sales_roll = slide_dbl(sales, mean, .before = 2, .complete = TRUE))
#> # A tibble: 10 x 4
#> # Groups:   day [2]
#>      day sales n_calls sales_roll
#>    <dbl> <int>   <int>      <dbl>
#>  1     1    50    1039       NA  
#>  2     1    43     398       NA  
#>  3     1    14     854       35.7
#>  4     1    25      51       27.3
#>  5     1    90     609       43  
#>  6     2    91     517       NA  
#>  7     2    69     718       NA  
#>  8     2    95     861       85  
#>  9     2    57     268       73.7
#> 10     2     9     941       53.7

If the function you want to apply when sliding takes a data frame as input, things get more complicated. One way to accomplish this is by utilizing the fact that you have access to . in the magrittr %>%. As an example, imagine you want to perform a rolling regression with sales as your outcome and n_calls as a predictor.

company %>%
  mutate(
    regressions = slide(
      .x = ., 
      .f = ~lm(sales ~ n_calls, .x), 
      .before = 2, 
      .complete = TRUE
    )
  )
#> # A tibble: 10 x 4
#>      day sales n_calls regressions
#>    <dbl> <int>   <int> <list>     
#>  1     1    50    1039 <NULL>     
#>  2     1    43     398 <NULL>     
#>  3     1    14     854 <lm>       
#>  4     1    25      51 <lm>       
#>  5     1    90     609 <lm>       
#>  6     2    91     517 <lm>       
#>  7     2    69     718 <lm>       
#>  8     2    95     861 <lm>       
#>  9     2    57     268 <lm>       
#> 10     2     9     941 <lm>

But here be dragons! The . you have access to has two problems.

No updating as you add new columns.

If you try and add columns in the mutate() and expect them to be available to you in ., you will be upset. In this case, lm() couldn’t find log_n_calls in the slice of . available through .x, so it looked in the surrounding environment and found the entire length 10 log_n_calls vector that we created and tried to pass that through to the regression.

company %>%
  mutate(
    log_n_calls = log(n_calls),
    regressions = slide(
      .x = ., 
      .f = ~lm(sales ~ log_n_calls, .x), 
      .before = 2, 
      .complete = TRUE
    )
  )
#> Error: Problem with `mutate()` input `regressions`.
#> x variable lengths differ (found for 'log_n_calls')
#> ℹ Input `regressions` is `slide(...)`.

To show this a bit simpler, let’s just try and access that log_n_calls column inside our slide function. We immediately get a slew of warnings because it doesn’t exist.

company %>%
  mutate(
    log_n_calls = log(n_calls),
    example = slide(., ~.x$log_n_calls)
  )
#> Warning: Problem with `mutate()` input `example`.
#> x Unknown or uninitialised column: `log_n_calls`.
#> ℹ Input `example` is `slide(., ~.x$log_n_calls)`.
#> Warning: Unknown or uninitialised column: `log_n_calls`.
#> Warning: Problem with `mutate()` input `example`.
#> x Unknown or uninitialised column: `log_n_calls`.
#> ℹ Input `example` is `slide(., ~.x$log_n_calls)`.
#> Warning: Unknown or uninitialised column: `log_n_calls`.
#> Warning: Problem with `mutate()` input `example`.
#> x Unknown or uninitialised column: `log_n_calls`.
#> ℹ Input `example` is `slide(., ~.x$log_n_calls)`.
#> Warning: Unknown or uninitialised column: `log_n_calls`.
#> Warning: Problem with `mutate()` input `example`.
#> x Unknown or uninitialised column: `log_n_calls`.
#> ℹ Input `example` is `slide(., ~.x$log_n_calls)`.
#> Warning: Unknown or uninitialised column: `log_n_calls`.
#> Warning: Problem with `mutate()` input `example`.
#> x Unknown or uninitialised column: `log_n_calls`.
#> ℹ Input `example` is `slide(., ~.x$log_n_calls)`.
#> Warning: Unknown or uninitialised column: `log_n_calls`.
#> Warning: Problem with `mutate()` input `example`.
#> x Unknown or uninitialised column: `log_n_calls`.
#> ℹ Input `example` is `slide(., ~.x$log_n_calls)`.
#> Warning: Unknown or uninitialised column: `log_n_calls`.
#> Warning: Problem with `mutate()` input `example`.
#> x Unknown or uninitialised column: `log_n_calls`.
#> ℹ Input `example` is `slide(., ~.x$log_n_calls)`.
#> Warning: Unknown or uninitialised column: `log_n_calls`.
#> Warning: Problem with `mutate()` input `example`.
#> x Unknown or uninitialised column: `log_n_calls`.
#> ℹ Input `example` is `slide(., ~.x$log_n_calls)`.
#> Warning: Unknown or uninitialised column: `log_n_calls`.
#> Warning: Problem with `mutate()` input `example`.
#> x Unknown or uninitialised column: `log_n_calls`.
#> ℹ Input `example` is `slide(., ~.x$log_n_calls)`.
#> Warning: Unknown or uninitialised column: `log_n_calls`.
#> Warning: Problem with `mutate()` input `example`.
#> x Unknown or uninitialised column: `log_n_calls`.
#> ℹ Input `example` is `slide(., ~.x$log_n_calls)`.
#> Warning: Unknown or uninitialised column: `log_n_calls`.
#> # A tibble: 10 x 5
#>      day sales n_calls log_n_calls example
#>    <dbl> <int>   <int>       <dbl> <list> 
#>  1     1    50    1039        6.95 <NULL> 
#>  2     1    43     398        5.99 <NULL> 
#>  3     1    14     854        6.75 <NULL> 
#>  4     1    25      51        3.93 <NULL> 
#>  5     1    90     609        6.41 <NULL> 
#>  6     2    91     517        6.25 <NULL> 
#>  7     2    69     718        6.58 <NULL> 
#>  8     2    95     861        6.76 <NULL> 
#>  9     2    57     268        5.59 <NULL> 
#> 10     2     9     941        6.85 <NULL>

No respect of groups

Even if you don’t create new columns in your mutate(), there is a high chance that you’ll use dplyr for its group_by() capability. If you try and use . with a grouped data frame, you’ll have issues as well because . won’t correspond to the current group, but will instead be the entire data frame.

company %>%
  group_by(day) %>%
  mutate(
    regressions = slide(
      .x = ., 
      .f = ~lm(sales ~ n_calls, .x), 
      .before = 2, 
      .complete = TRUE
    )
  )
#> Error: Problem with `mutate()` input `regressions`.
#> x Input `regressions` can't be recycled to size 5.
#> ℹ Input `regressions` is `slide(.x = ., .f = ~lm(sales ~ n_calls, .x), .before = 2, .complete = TRUE)`.
#> ℹ Input `regressions` must be size 5 or 1, not 10.
#> ℹ The error occurred in group 1: day = 1.

Solution (sort of)

The “problem” is that currently dplyr does not give us any way to easily access the current data frame that we are processing. We can access individual columns by name, but the entire data frame object is impossible to get access to. I’m optimistic that this will get easier in the coming months, but in the mean time here are a few solutions that you can use.

Rather than trying to pass the data frame on with ., you can construct it on the fly from the individual vectors that you do have access to. This means you would have access to any columns that were created in the same mutate() call.

company %>%
  mutate(
    log_n_calls = log(n_calls),
    regressions = slide(
      .x = tibble(sales = sales, log_n_calls = log_n_calls), 
      .f = ~lm(sales ~ log_n_calls, .x), 
      .before = 2, 
      .complete = TRUE
    )
  )
#> # A tibble: 10 x 5
#>      day sales n_calls log_n_calls regressions
#>    <dbl> <int>   <int>       <dbl> <list>     
#>  1     1    50    1039        6.95 <NULL>     
#>  2     1    43     398        5.99 <NULL>     
#>  3     1    14     854        6.75 <lm>       
#>  4     1    25      51        3.93 <lm>       
#>  5     1    90     609        6.41 <lm>       
#>  6     2    91     517        6.25 <lm>       
#>  7     2    69     718        6.58 <lm>       
#>  8     2    95     861        6.76 <lm>       
#>  9     2    57     268        5.59 <lm>       
#> 10     2     9     941        6.85 <lm>

This also works for the grouped example.

company %>%
  group_by(day) %>%
  mutate(
    regressions = slide(
      .x = tibble(sales = sales, n_calls = n_calls), 
      .f = ~lm(sales ~ n_calls, .x), 
      .before = 2, 
      .complete = TRUE
    )
  )
#> # A tibble: 10 x 4
#> # Groups:   day [2]
#>      day sales n_calls regressions
#>    <dbl> <int>   <int> <list>     
#>  1     1    50    1039 <NULL>     
#>  2     1    43     398 <NULL>     
#>  3     1    14     854 <lm>       
#>  4     1    25      51 <lm>       
#>  5     1    90     609 <lm>       
#>  6     2    91     517 <NULL>     
#>  7     2    69     718 <NULL>     
#>  8     2    95     861 <lm>       
#>  9     2    57     268 <lm>       
#> 10     2     9     941 <lm>

Depending on your comfort with the new dplyr::group_modify() function, it can provide a solution that doesn’t require you to explicitly create a tibble of the columns you require. group_modify() allows us to apply a function on each group, giving us access to the current data frame as the first argument to the function, and the “key” as the second argument. We don’t need to worry about the key for this example, but we need to “swallow” it with ... in our function.

The way I tackle these problems is to construct a function that works on one data frame group, then apply it to all of them.

single_group_regressions <- function(.data_group, ...) {
  regressions <- slide(
      .x = .data_group, 
      .f = ~lm(sales ~ n_calls, .x), 
      .before = 2, 
      .complete = TRUE
    )
  
  mutate(.data_group, regressions = regressions)
}

Test it on one group.

day_one <- filter(company, day == 1)
single_group_regressions(day_one)
#> # A tibble: 5 x 4
#>     day sales n_calls regressions
#>   <dbl> <int>   <int> <list>     
#> 1     1    50    1039 <NULL>     
#> 2     1    43     398 <NULL>     
#> 3     1    14     854 <lm>       
#> 4     1    25      51 <lm>       
#> 5     1    90     609 <lm>

Now apply it to all groups with group_modify()!

company %>%
  group_by(day) %>%
  group_modify(single_group_regressions)
#> # A tibble: 10 x 4
#> # Groups:   day [2]
#>      day sales n_calls regressions
#>    <dbl> <int>   <int> <list>     
#>  1     1    50    1039 <NULL>     
#>  2     1    43     398 <NULL>     
#>  3     1    14     854 <lm>       
#>  4     1    25      51 <lm>       
#>  5     1    90     609 <lm>       
#>  6     2    91     517 <NULL>     
#>  7     2    69     718 <NULL>     
#>  8     2    95     861 <lm>       
#>  9     2    57     268 <lm>       
#> 10     2     9     941 <lm>