Looping

Introduction

This package contains a flexible framework for extending the pipe into a loop. The basic idea is this: I often run into the problem of wanting to access an unnamed intermediate in a pipe. Why? A basic strategy of working with data frames is to focus on a certain aspect of the data frame, make some changes, and then reincorporate these changes into the original data frame. This work-flow is best understood through illustration.

Note

This tutorial assumes familiarity with Hadley Wickham’s dplyr and magrittr packages. If you don’t know what I’m talking about, go look them up. Your life is about to get a whole lot easier

Set-up

Import useful libraries for chaining, knitr for table output, and of course, loopr.

library(loopr)
library(dplyr)
library(magrittr)
library(knitr)

Define our loop object.

loop = loopClass$new()

Set up an extremely simple data frame for illustration.

id = c(1, 2, 3, 4)
toFix = c(0, 0, 1, 1)
group = c(1, 1, 1, 0)
example = data_frame(id, toFix, group)
kable(example)

id	toFix	group
1	0	1
2	0	1
3	1	1
4	1	0

Stack

loopr relies on a stack framework. Let’s initialize one.

stack = stackClass$new()

We can push data onto the stack like this. The names are optional.

stack$push(1, name = "first")

## [1] 1

stack$push(2, name = "second")

## [1] 2

stack$push(3, name = "third")

## [1] 3

We can peek at the top of the stack:

stack$peek

## [1] 3

or at the whole thing.

stack$stack %>%
  as.data.frame %>%
  kable

bottom	first	second	third
NA	1	2	3

We can find the height of the stack as well:

stack$height

## [1] 4

We can also pop off items from the stack:

stack$pop

## [1] 3

stack$pop

## [1] 2

stack$pop

## [1] 1

Now the stack is empty.

stack$stack

## $bottom
## [1] NA

Loop

Why is this important? A loop object inherits from stack.

Begin

The begin method is simply a copy of push. After the loop begins, you can focus on any part of your data while still having access to the original data.

"first" %>%
loop$begin()

## [1] "first"

End

To end the loop, you need to merge the data at the beginning of the loop with the data at the end. There are two ending methods defined in loopr: end and cross. Ending the loop takes a function, uses a pop from the loop stack as the first argument to the given function, and its own first argument (or chained argument) as the second.

"second" %>%
  loop$end(paste)

## [1] "first second"

Cross

cross is nearly identical, but the order of the arguments gets reversed.

"first" %>%
  loop$begin()

## [1] "first"

"second" %>%
  loop$cross(paste)

## [1] "second first"

This is much easier to explain in code than in words.

end(endData, FUN, ...) = FUN(stack$pop, endData, ...)

cross(crossData, FUN, ...) = FUN(crossData, stack$pop, ...)

Ending functions

There are two useful ending functions that are included in this package:insert and amend. Why are special ending functions needed? In general, traditional join functions are not well suited to the focus-modify-restore work-flow. We need insert and amend to prioritize information in modified data over information in the original data.

insert

insert is the slightly more simple case. Let’s use our example data again.

Create a set of data to insert.

insertData =
  example %>%
  filter(toFix == 0) %>%
  mutate(toFix = 1) %>%
  select(-group)

kable(insertData)

id	toFix
1	1
2	1

Now let’s insert it back into the original data.

insert(example, insertData, by = "id") %>%
  kable

id	toFix	group
1	1	NA
2	1	NA
3	1	1
4	1	0

What happened? Where the by variables matched, insert excised all rows from example and inserted insertData. At the end, data was sorted by the by variable. The by variable (or variables) must be included in the function call.

amend

Let’s take a look at the slightly more complicated ending function: amend. To understand amend, we first need to understand the underlying column update function.

amendColumns

amendColumns updates an old set of columns with all non-NA values from a matching new set of columns.

Build example data.

oldColumn1 = c(0, 0);
newColumn1 = c(1, NA)
oldColumn2 = c(0, 0);
newColumn2 = c(NA, 1)
columnData = data_frame(oldColumn1, newColumn1, oldColumn2, newColumn2)
kable(columnData)

oldColumn1	newColumn1	oldColumn2	newColumn2
0	1	0	NA
0	NA	0	1

Now run amendColumns.

columnData %>%
  amendColumns(
    c("oldColumn1", "oldColumn2"), 
    c("newColumn1", "newColumn2")) %>%
  kable

oldColumn1	oldColumn2
1	0
0	1

fillColumns

There is also a matching function called fillColumns. In this function, NA’s from newColumn are replaced with numbers from the oldColumn, but nothing else.

oldColumn = c(0, 0)
newColumn = c(1, NA)
columnData %>%
  fillColumns(c("newColumn1", "newColumn2"),
              c("oldColumn1", "oldColumn2")) %>%
  kable

newColumn1	newColumn2
1	0
0	1

amend

amend is simply dplyr::full_join followed by amendColumns to over-write non-key columns from the original dataset with matching-named columns from the new dataset. In this case, group from amendData overwrites group from example.

amendData = insertData

example %>%
  amend(amendData, by = "id") %>%
  kable

## Amending columns: toFix

id	toFix	group
1	1	1
2	1	1
3	1	1
4	1	0

If it is not included, by defaults to the grouping variables in data.

example %>% 
  group_by(id) %>%
  amend(amendData) %>%
  kable

## Amending columns: toFix

id	toFix	group
1	1	1
2	1	1
3	1	1
4	1	0

A warning: amend internally uses the suffix "toFix". If this suffix is already used in your data, modify the suffix argument.

Illustration

Now that we understand how it works, let’s use use our loop!

Remind ourselves of what the example data looks like.

kable(example)

id	toFix	group
1	0	1
2	0	1
3	1	1
4	1	0

Conditional mutation

Here, we convert toFix to 0 when group is 0.

example %>%
  ungroup %>%
  loop$begin() %>%
    filter(group == 0) %>%
    mutate(toFix = 0) %>%
  loop$end(insert, by = "id") %>%
  kable

id	toFix	group
1	0	1
2	0	1
3	1	1
4	0	0

In general, insert is best suited to filter/slice type operations.

Merged summarize

Here, we summarize toFix in each of the two groups, reverse the results, and then reintegrate the summary into the original data.

example %>%
  group_by(group) %>%
  loop$begin() %>%
    summarize(toFix = mean(toFix)) %>%
    mutate(group = rev(group)) %>%
  loop$end(amend) %>%
  kable

## Amending columns: toFix

group	id	toFix
0	4	0.3333333
1	1	1.0000000
1	2	1.0000000
1	3	1.0000000

In general, amend is best suited to summarize/do type operations.

This is only the tip of the iceberg. Do not feel limited to using amend and insert as ending functions. A whole host of others could be useful: join functions, merge functions, even setNames.

setNames

Here, we will suffix the names of all the variables within the context of a chain.

example %>%
  mutate(group = group + 1) %>%
  loop$begin() %>%
    names %>%
    paste0("Suffix") %>%
  loop$end(setNames) %>%
  kable

idSuffix	toFixSuffix	groupSuffix
1	0	2
2	0	2
3	1	2
4	1	1

bind_rows

Here, we will double the data.

example %>%
  mutate(replication = 1) %>%
  loop$begin() %>%
    mutate(replication = 2) %>%
  loop$end(bind_rows) %>%
  kable

id	toFix	group	replication
1	0	1	1
2	0	1	1
3	1	1	1
4	1	0	1
1	0	1	2
2	0	1	2
3	1	1	2
4	1	0	2

Loops within loops

Loops within loops are in fact quite possible. I would be cautious using them. It can be exhilarating, but make sure to indent each loop carefully. Also, it is a good idea to give a name to each loop. This allows one to interpret loop$stack for debugging. Here is a quick example that filters the data, replicates the columns, and then re-merges.

example %>%
  loop$begin(name = "original") %>%
    filter(group == 1) %>%
    loop$begin(name = "filtered") %>%
       names %>%
       paste0("Extra") %>%
    loop$end(setNames) %>%
    rename(id = idExtra) %>%
  loop$end(amend, by = "id") %>%
  kable

## Joining by: "id"

id	toFix	group	toFixExtra	groupExtra
1	0	1	0	1
2	0	1	0	1
3	1	1	1	1
4	1	0	NA	NA

Looping

Brandon Taylor

2015-05-27

Looping

Introduction

Note

Set-up

Stack

Loop

Begin

End

Cross

Ending functions

insert

amend

amendColumns

fillColumns

amend

Illustration

Conditional mutation

Merged summarize

setNames

bind_rows

Loops within loops