Abstract

This vignette is an introduction to the package groupdata2.
groupdata2 is a set of methods for easy grouping, windowing, folding, partitioning, splitting and balancing of data.
We will go through finding and creating groups automatically with the ‘l_starts’ method.

For a more extensive description of groupdata2, please see Description of groupdata2

Contact author at r-pkgs@ludvigolsen.dk

Introduction

In this vignette, we will use the ‘l_starts’ method with group() to allow transferring of information from one dataset to another. We will use the automatic grouping function that finds group starts all by itself.

Attach packages

library(groupdata2)
library(dplyr) # %>%
library(knitr) # kable

Data

3 participants were asked to solve a task. They had to take turns but could go for multiple runs of the task before taking a break and letting the next participant take over. They had 2 turns each. Let’s call each turn a session, i.e. there was 6 sessions. A team of experts would rate how well the participant did throughout the entire session, meaning that if the participant had some bad runs, they would have to make a choice whether to save energy for the other session or whether to try and correct the rating of the current session. For each run of the task, we recorded how many errors the participant made.


df_observations <- data.frame(
  "run" = 1:30,
  "participant" = c(1,1,1,1,
             2,2,2,2,2,2,
             3,3,3,3,
             1,1,1,1,1,1,1,
             2,2,2,
             3,3,3,3,3,3),
  "errors" = c(3,2,5,3,
               0,0,1,1,0,1,
               6,4,3,1,
               2,1,3,2,1,1,0,
               0,0,1,
               3,3,4,2,2,1)
)

# Show the first 20 rows of data frame
df_observations %>% head(20) %>%  kable()

run	participant	errors
1	1	3
2	1	2
3	1	5
4	1	3
5	2	0
6	2	0
7	2	1
8	2	1
9	2	0
10	2	1
11	3	6
12	3	4
13	3	3
14	3	1
15	1	2
16	1	1
17	1	3
18	1	2
19	1	1
20	1	1


df_ratings <- data.frame(
  "session" = c(1:6),
  "rating" = c(3,8,2,5,9,4)
)

df_ratings %>% kable()

session	rating
1	3
2	8
3	2
4	5
5	9
6	4

We would like to get the ratings into the data frame with observations. For this we will first create a session column, and then get the ratings for the sessions. To do this we will use group() with the ‘l_starts’ method. This methods takes group start values, finds those values in a specified column, and creates groups that begin at the start values. To show this, let’s try it out with some start values before having group() find them automatically.


group(df_observations, n = c(1,2,3,1,2,3), method = 'l_starts', 
      starts_col = 'participant', col_name = 'session') %>% 
  kable()

run	participant	errors	session
1	1	3	1
2	1	2	1
3	1	5	1
4	1	3	1
5	2	0	2
6	2	0	2
7	2	1	2
8	2	1	2
9	2	0	2
10	2	1	2
11	3	6	3
12	3	4	3
13	3	3	3
14	3	1	3
15	1	2	4
16	1	1	4
17	1	3	4
18	1	2	4
19	1	1	4
20	1	1	4
21	1	0	4
22	2	0	5
23	2	0	5
24	2	1	5
25	3	3	6
26	3	3	6
27	3	4	6
28	3	2	6
29	3	2	6
30	3	1	6

group() went through the participant column and found one value from n at a time. When it encountered the value, it noted down the row index and continued down the column searching for the next value in n. In the end it started groups at the found row indices from top to bottom. Since our data has the same value in the participant column for the entire session, we can actually get group() to find these group starts automatically. It will go through the given column, and whenever it encounters a new value, i.e. one that is different from the previous row, it starts a new group.

df_observations <- group(df_observations, n = 'auto', 
                         method = 'l_starts',
                         starts_col = 'participant', 
                         col_name = 'session') 

df_observations %>% 
  kable()

run	participant	errors	session
1	1	3	1
2	1	2	1
3	1	5	1
4	1	3	1
5	2	0	2
6	2	0	2
7	2	1	2
8	2	1	2
9	2	0	2
10	2	1	2
11	3	6	3
12	3	4	3
13	3	3	3
14	3	1	3
15	1	2	4
16	1	1	4
17	1	3	4
18	1	2	4
19	1	1	4
20	1	1	4
21	1	0	4
22	2	0	5
23	2	0	5
24	2	1	5
25	3	3	6
26	3	3	6
27	3	4	6
28	3	2	6
29	3	2	6
30	3	1	6

And it works! :)
If you just want the group starts, you can use the function find_starts().

Now that we have the session information, we can transfer the ratings from the ratings data frame.

df_merged <- merge(df_observations, df_ratings, by = 'session')

# Show head of df_merged
df_merged %>% head(15) %>% kable()

session	run	participant	errors	rating
1	1	1	3	3
1	2	1	2	3
1	3	1	5	3
1	4	1	3	3
2	5	2	0	8
2	6	2	0	8
2	7	2	1	8
2	8	2	1	8
2	9	2	0	8
2	10	2	1	8
3	11	3	6	2
3	12	3	4	2
3	13	3	3	2
3	14	3	1	2
4	15	1	2	5

Now, we can find the average number of errors per session and see if they correlate with the experts’ ratings.

avg_errors <- df_merged %>% 
  group_by(session) %>% 
  dplyr::summarize("avg_errors" = mean(errors))
#> `summarise()` ungrouping output (override with `.groups` argument)

avg_errors %>% kable()

session	avg_errors
1	3.2500000
2	0.5000000
3	3.5000000
4	1.4285714
5	0.3333333
6	2.5000000

Let’s transfer the averages to the merged data frame. Once again, we just use merge(). Since we have just one rating per session, we will get only the first row of each session.

df_summarized <- merge(df_merged, avg_errors, by = 'session') %>% 
  group_by(session) %>%  # For each session
  filter(row_number()==1) %>%  # Get first row
  select(-errors) # Remove errors column as we use avg_errors now

df_summarized %>% kable()

session	run	participant	rating	avg_errors
1	1	1	3	3.2500000
2	5	2	8	0.5000000
3	11	3	2	3.5000000
4	15	1	5	1.4285714
5	22	2	9	0.3333333
6	25	3	4	2.5000000

We have 1 row per session with the participant, the rating and the average errors. If we wanted to know how many runs a session contained, we could extract it from the ‘run’ column.

Now let’s check if there’s a correlation between ratings and average errors.

cor(df_summarized$rating, df_summarized$avg_errors)
#> [1] -0.9739425

It seems they are highly negatively correlated, so participants with fewer errors have higher ratings and vice versa.

Automatic groups with groupdata2

Ludvig Renbo Olsen

2020-06-15

Introduction

Attach packages

Data

Outro

run	participant	errors
1	1	3
2	1	2
3	1	5
4	1	3
5	2	0
6	2	0
7	2	1
8	2	1
9	2	0
10	2	1
11	3	6
12	3	4
13	3	3
14	3	1
15	1	2
16	1	1
17	1	3
18	1	2
19	1	1
20	1	1

run	participant	errors	session
1	1	3	1
2	1	2	1
3	1	5	1
4	1	3	1
5	2	0	2
6	2	0	2
7	2	1	2
8	2	1	2
9	2	0	2
10	2	1	2
11	3	6	3
12	3	4	3
13	3	3	3
14	3	1	3
15	1	2	4
16	1	1	4
17	1	3	4
18	1	2	4
19	1	1	4
20	1	1	4
21	1	0	4
22	2	0	5
23	2	0	5
24	2	1	5
25	3	3	6
26	3	3	6
27	3	4	6
28	3	2	6
29	3	2	6
30	3	1	6

run	participant	errors	session
1	1	3	1
2	1	2	1
3	1	5	1
4	1	3	1
5	2	0	2
6	2	0	2
7	2	1	2
8	2	1	2
9	2	0	2
10	2	1	2
11	3	6	3
12	3	4	3
13	3	3	3
14	3	1	3
15	1	2	4
16	1	1	4
17	1	3	4
18	1	2	4
19	1	1	4
20	1	1	4
21	1	0	4
22	2	0	5
23	2	0	5
24	2	1	5
25	3	3	6
26	3	3	6
27	3	4	6
28	3	2	6
29	3	2	6
30	3	1	6

session	run	participant	errors	rating
1	1	1	3	3
1	2	1	2	3
1	3	1	5	3
1	4	1	3	3
2	5	2	0	8
2	6	2	0	8
2	7	2	1	8
2	8	2	1	8
2	9	2	0	8
2	10	2	1	8
3	11	3	6	2
3	12	3	4	2
3	13	3	3	2
3	14	3	1	2
4	15	1	2	5

run	participant	errors
1	1	3
2	1	2
3	1	5
4	1	3
5	2	0
6	2	0
7	2	1
8	2	1
9	2	0
10	2	1
11	3	6
12	3	4
13	3	3
14	3	1
15	1	2
16	1	1
17	1	3
18	1	2
19	1	1
20	1	1

run	participant	errors	session
1	1	3	1
2	1	2	1
3	1	5	1
4	1	3	1
5	2	0	2
6	2	0	2
7	2	1	2
8	2	1	2
9	2	0	2
10	2	1	2
11	3	6	3
12	3	4	3
13	3	3	3
14	3	1	3
15	1	2	4
16	1	1	4
17	1	3	4
18	1	2	4
19	1	1	4
20	1	1	4
21	1	0	4
22	2	0	5
23	2	0	5
24	2	1	5
25	3	3	6
26	3	3	6
27	3	4	6
28	3	2	6
29	3	2	6
30	3	1	6

run	participant	errors	session
1	1	3	1
2	1	2	1
3	1	5	1
4	1	3	1
5	2	0	2
6	2	0	2
7	2	1	2
8	2	1	2
9	2	0	2
10	2	1	2
11	3	6	3
12	3	4	3
13	3	3	3
14	3	1	3
15	1	2	4
16	1	1	4
17	1	3	4
18	1	2	4
19	1	1	4
20	1	1	4
21	1	0	4
22	2	0	5
23	2	0	5
24	2	1	5
25	3	3	6
26	3	3	6
27	3	4	6
28	3	2	6
29	3	2	6
30	3	1	6

session	run	participant	errors	rating
1	1	1	3	3
1	2	1	2	3
1	3	1	5	3
1	4	1	3	3
2	5	2	0	8
2	6	2	0	8
2	7	2	1	8
2	8	2	1	8
2	9	2	0	8
2	10	2	1	8
3	11	3	6	2
3	12	3	4	2
3	13	3	3	2
3	14	3	1	2
4	15	1	2	5

run	participant	errors
1	1	3
2	1	2
3	1	5
4	1	3
5	2	0
6	2	0
7	2	1
8	2	1
9	2	0
10	2	1
11	3	6
12	3	4
13	3	3
14	3	1
15	1	2
16	1	1
17	1	3
18	1	2
19	1	1
20	1	1

run	participant	errors	session
1	1	3	1
2	1	2	1
3	1	5	1
4	1	3	1
5	2	0	2
6	2	0	2
7	2	1	2
8	2	1	2
9	2	0	2
10	2	1	2
11	3	6	3
12	3	4	3
13	3	3	3
14	3	1	3
15	1	2	4
16	1	1	4
17	1	3	4
18	1	2	4
19	1	1	4
20	1	1	4
21	1	0	4
22	2	0	5
23	2	0	5
24	2	1	5
25	3	3	6
26	3	3	6
27	3	4	6
28	3	2	6
29	3	2	6
30	3	1	6

run	participant	errors	session
1	1	3	1
2	1	2	1
3	1	5	1
4	1	3	1
5	2	0	2
6	2	0	2
7	2	1	2
8	2	1	2
9	2	0	2
10	2	1	2
11	3	6	3
12	3	4	3
13	3	3	3
14	3	1	3
15	1	2	4
16	1	1	4
17	1	3	4
18	1	2	4
19	1	1	4
20	1	1	4
21	1	0	4
22	2	0	5
23	2	0	5
24	2	1	5
25	3	3	6
26	3	3	6
27	3	4	6
28	3	2	6
29	3	2	6
30	3	1	6

session	run	participant	errors	rating
1	1	1	3	3
1	2	1	2	3
1	3	1	5	3
1	4	1	3	3
2	5	2	0	8
2	6	2	0	8
2	7	2	1	8
2	8	2	1	8
2	9	2	0	8
2	10	2	1	8
3	11	3	6	2
3	12	3	4	2
3	13	3	3	2
3	14	3	1	2
4	15	1	2	5