This package contains a flexible framework for extending the pipe into a loop. The basic idea is this: I often run into the problem of wanting to access an unnamed intermediate in a pipe. Why? A basic strategy of working with data frames is to focus on a certain aspect of the data frame, make some changes, and then reincorporate these changes into the original data frame. This work-flow is best understood through illustration.
This tutorial assumes familiarity with Hadley Wickham’s dplyr
and magrittr
packages. If you don’t know what I’m talking about, go look them up. Your life is about to get a whole lot easier
Import useful libraries for chaining, knitr
for table output, and of course, loopr
.
library(loopr)
library(dplyr)
library(magrittr)
library(knitr)
Define our loop object.
loop = loopClass$new()
Set up an extremely simple data frame for illustration.
id = c(1, 2, 3, 4)
toFix = c(0, 0, 1, 1)
group = c(1, 1, 1, 0)
example = data_frame(id, toFix, group)
kable(example)
id | toFix | group |
---|---|---|
1 | 0 | 1 |
2 | 0 | 1 |
3 | 1 | 1 |
4 | 1 | 0 |
loopr
relies on a stack framework. Let’s initialize one.
stack = stackClass$new()
We can push
data onto the stack
like this. The names are optional.
stack$push(1, name = "first")
## [1] 1
stack$push(2, name = "second")
## [1] 2
stack$push(3, name = "third")
## [1] 3
We can peek
at the top of the stack
:
stack$peek
## [1] 3
or at the whole thing.
stack$stack %>%
as.data.frame %>%
kable
bottom | first | second | third |
---|---|---|---|
NA | 1 | 2 | 3 |
We can find the height
of the stack
as well:
stack$height
## [1] 4
We can also pop
off items from the stack
:
stack$pop
## [1] 3
stack$pop
## [1] 2
stack$pop
## [1] 1
Now the stack
is empty.
stack$stack
## $bottom
## [1] NA
Why is this important? A loop
object inherits from stack
.
The begin
method is simply a copy of push
. After the loop begins, you can focus on any part of your data while still having access to the original data.
"first" %>%
loop$begin()
## [1] "first"
To end the loop, you need to merge the data at the beginning of the loop with the data at the end. There are two ending methods defined in loopr
: end
and cross
. Ending the loop takes a function, uses a pop
from the loop
stack
as the first argument to the given function, and its own first argument (or chained argument) as the second.
"second" %>%
loop$end(paste)
## [1] "first second"
cross
is nearly identical, but the order of the arguments gets reversed.
"first" %>%
loop$begin()
## [1] "first"
"second" %>%
loop$cross(paste)
## [1] "second first"
This is much easier to explain in code than in words.
end(endData, FUN, ...) = FUN(stack$pop, endData, ...)
cross(crossData, FUN, ...) = FUN(crossData, stack$pop, ...)
There are two useful ending functions that are included in this package:insert
and amend
. Why are special ending functions needed? In general, traditional join functions are not well suited to the focus-modify-restore work-flow. We need insert
and amend
to prioritize information in modified data over information in the original data.
insert
is the slightly more simple case. Let’s use our example data again.
Create a set of data to insert
.
insertData =
example %>%
filter(toFix == 0) %>%
mutate(toFix = 1) %>%
select(-group)
kable(insertData)
id | toFix |
---|---|
1 | 1 |
2 | 1 |
Now let’s insert
it back into the original data.
insert(example, insertData, by = "id") %>%
kable
id | toFix | group |
---|---|---|
1 | 1 | NA |
2 | 1 | NA |
3 | 1 | 1 |
4 | 1 | 0 |
What happened? Where the by
variables matched, insert
excised all rows from example
and inserted insertData
. At the end, data was sorted by the by
variable. The by
variable (or variables) must be included in the function call.
Let’s take a look at the slightly more complicated ending function: amend
. To understand amend, we first need to understand the underlying column update function.
amendColumns
updates an old set of columns with all non-NA
values from a matching new set of columns.
Build example data.
oldColumn1 = c(0, 0);
newColumn1 = c(1, NA)
oldColumn2 = c(0, 0);
newColumn2 = c(NA, 1)
columnData = data_frame(oldColumn1, newColumn1, oldColumn2, newColumn2)
kable(columnData)
oldColumn1 | newColumn1 | oldColumn2 | newColumn2 |
---|---|---|---|
0 | 1 | 0 | NA |
0 | NA | 0 | 1 |
Now run amendColumns
.
columnData %>%
amendColumns(
c("oldColumn1", "oldColumn2"),
c("newColumn1", "newColumn2")) %>%
kable
oldColumn1 | oldColumn2 |
---|---|
1 | 0 |
0 | 1 |
There is also a matching function called fillColumns. In this function, NA
’s from newColumn
are replaced with numbers from the oldColumn
, but nothing else.
oldColumn = c(0, 0)
newColumn = c(1, NA)
columnData %>%
fillColumns(c("newColumn1", "newColumn2"),
c("oldColumn1", "oldColumn2")) %>%
kable
newColumn1 | newColumn2 |
---|---|
1 | 0 |
0 | 1 |
amend
is simply dplyr::full_join
followed by amendColumns
to over-write non-key columns from the original dataset with matching-named columns from the new dataset. In this case, group
from amendData
overwrites group
from example
.
amendData = insertData
example %>%
amend(amendData, by = "id") %>%
kable
## Amending columns: toFix
id | toFix | group |
---|---|---|
1 | 1 | 1 |
2 | 1 | 1 |
3 | 1 | 1 |
4 | 1 | 0 |
If it is not included, by
defaults to the grouping variables in data.
example %>%
group_by(id) %>%
amend(amendData) %>%
kable
## Amending columns: toFix
id | toFix | group |
---|---|---|
1 | 1 | 1 |
2 | 1 | 1 |
3 | 1 | 1 |
4 | 1 | 0 |
A warning: amend
internally uses the suffix "toFix"
. If this suffix is already used in your data, modify the suffix
argument.
Now that we understand how it works, let’s use use our loop
!
Remind ourselves of what the example
data looks like.
kable(example)
id | toFix | group |
---|---|---|
1 | 0 | 1 |
2 | 0 | 1 |
3 | 1 | 1 |
4 | 1 | 0 |
Here, we convert toFix
to 0 when group
is 0.
example %>%
ungroup %>%
loop$begin() %>%
filter(group == 0) %>%
mutate(toFix = 0) %>%
loop$end(insert, by = "id") %>%
kable
id | toFix | group |
---|---|---|
1 | 0 | 1 |
2 | 0 | 1 |
3 | 1 | 1 |
4 | 0 | 0 |
In general, insert
is best suited to filter
/slice
type operations.
Here, we summarize toFix in each of the two groups, reverse the results, and then reintegrate the summary into the original data.
example %>%
group_by(group) %>%
loop$begin() %>%
summarize(toFix = mean(toFix)) %>%
mutate(group = rev(group)) %>%
loop$end(amend) %>%
kable
## Amending columns: toFix
group | id | toFix |
---|---|---|
0 | 4 | 0.3333333 |
1 | 1 | 1.0000000 |
1 | 2 | 1.0000000 |
1 | 3 | 1.0000000 |
In general, amend
is best suited to summarize
/do
type operations.
This is only the tip of the iceberg. Do not feel limited to using amend
and insert
as ending functions. A whole host of others could be useful: join functions, merge functions, even setNames.
Here, we will suffix the names of all the variables within the context of a chain.
example %>%
mutate(group = group + 1) %>%
loop$begin() %>%
names %>%
paste0("Suffix") %>%
loop$end(setNames) %>%
kable
idSuffix | toFixSuffix | groupSuffix |
---|---|---|
1 | 0 | 2 |
2 | 0 | 2 |
3 | 1 | 2 |
4 | 1 | 1 |
Here, we will double the data.
example %>%
mutate(replication = 1) %>%
loop$begin() %>%
mutate(replication = 2) %>%
loop$end(bind_rows) %>%
kable
id | toFix | group | replication |
---|---|---|---|
1 | 0 | 1 | 1 |
2 | 0 | 1 | 1 |
3 | 1 | 1 | 1 |
4 | 1 | 0 | 1 |
1 | 0 | 1 | 2 |
2 | 0 | 1 | 2 |
3 | 1 | 1 | 2 |
4 | 1 | 0 | 2 |
Loops within loops are in fact quite possible. I would be cautious using them. It can be exhilarating, but make sure to indent each loop carefully. Also, it is a good idea to give a name to each loop. This allows one to interpret loop$stack
for debugging. Here is a quick example that filters the data, replicates the columns, and then re-merges.
example %>%
loop$begin(name = "original") %>%
filter(group == 1) %>%
loop$begin(name = "filtered") %>%
names %>%
paste0("Extra") %>%
loop$end(setNames) %>%
rename(id = idExtra) %>%
loop$end(amend, by = "id") %>%
kable
## Joining by: "id"
id | toFix | group | toFixExtra | groupExtra |
---|---|---|---|---|
1 | 0 | 1 | 0 | 1 |
2 | 0 | 1 | 0 | 1 |
3 | 1 | 1 | 1 | 1 |
4 | 1 | 0 | NA | NA |