Abstract

groupdata2 is a set of methods for easy grouping, windowing, folding, partitioning, splitting and balancing of data.
Create balanced folds for cross-validation or divide a time series into windows.
Balance group sizes with up- and downsampling.
This vignette contains descriptions of functions and methods, along with simple examples of usage. For a gentler introduction to groupdata2, please see Introduction to groupdata2

Contact author at r-pkgs@ludvigolsen.dk

Grouping Methods

There are currently 9 methods for grouping the data.

It is possible to create groups based on number of groups (default), group size, list of group sizes, list of group start positions, step size or prime number to start at. These can be passed as whole number(s) or percentage(s), while method ‘l_starts’ can also use ‘auto’.

Here, we will take a look at the different methods.

Method: ‘greedy’

‘greedy’ uses group size for dividing up the data.
Greedy means that each group grabs as many elements as possible (up to the specified size), meaning that there might be less elements available to the last group, but that all other groups than the last are guaranteed to have the size specified.

Example

We have a vector with 57 values. We want to have group sizes of 10.

The greedy splitter will return groups with this many values in them:
10, 10, 10, 10, 10, 7

By setting force_equal to TRUE, we discard the last group if it contains fewer values than the other groups.

Example

We have a vector with 57 values. We want to have group sizes of 10.

The greedy splitter with force_equal set to TRUE will return groups with this many values in them:
10, 10, 10, 10, 10

meaning that 7 values have been discarded.

Method: ‘n_dist’ (Default)

‘n_dist’ uses a specified number of groups to divide up the data.
First it creates equal groups as large as possible. Then, if there are any excess data points, it distributes them across the groups.

Example

We have a vector with 57 values. We want to get back 5 groups.

‘n_dist’ with default settings would return groups with this many values in them:

11, 11, 12, 11, 12

By setting force_equal to TRUE, ‘n_dist’ will create the largest possible, equally sized groups by discarding excess data elements.

Example

‘n_dist’ with force_equal set to TRUE would return groups with this many values in them:

11, 11, 11, 11, 11

meaning that 2 values have been discarded.

Method: ‘n_fill’

‘n_fill’ uses a specified number of groups to divide up the data.
First it creates equal groups as large as possible. Then, if there are any excess data points, it places them in the first groups.
By setting descending to TRUE, it would be the last groups though.

Example

We have a vector with 57 values. We want to get back 5 groups.

‘n_fill’ with default settings would return groups with this many values in them:

12, 12, 11, 11, 11

By setting force_equal to TRUE, ‘n_fill’ will create the largest possible, equally sized groups by discarding excess data elements.

Example

‘n_fill’ with force_equal set to TRUE would return groups with this many values in them:

11, 11, 11, 11, 11

meaning that 2 values have been discarded.

Method: ‘n_last’

‘n_last’ uses a specified number of groups to divide up the data.

With default settings, it tries to make the groups as equally sized as possible, but notice that the last group might contain fewer or more elements, if the length of the data is not divisible with the number of groups. All, but the last, groups are guaranteed to contain the same number of elements.

Example

We have a vector with 57 values. We want to get back 5 groups.

‘n_last’ with default settings would return groups with this many values in them:

11, 11, 11, 11, 13

By setting force_equal to TRUE, ‘n_last’ will create the largest possible, equally sized groups by discarding excess data elements.

Example

‘n_last’ with force_equal set to TRUE would return groups with this many values in them:

11, 11, 11, 11, 11

meaning that 2 values have been discarded.

Notice that ‘n_last’ will always return the given number of groups. It will never return a group with zero elements. For some situations that means that the last group will contain a lot of elements. Asked to divide a vector with 57 elements into 20 groups, the first 19 groups will contain 2 elements, while the last group will itself contain 19 elements. Had we instead asked it to divide the vector into 19 groups, we would have had 3 elements in all groups.

Method: ‘n_rand’

‘n_fill’ uses a specified number of groups to divide up the data.
First it creates equal groups as large as possible. Then, if there are any excess data points, it places them randomly in the groups.
N.B.: It only places one extra element per group.

Example

We have a vector with 57 values. We want to get back 5 groups.

‘n_rand’ with default settings could return groups with this many values in them:

12, 11, 11, 11, 12

By setting force_equal to TRUE, ‘n_rand’ will create the largest possible, equally sized groups by discarding excess data elements.

Example

‘n_rand’ with force_equal set to TRUE would return groups with this many values in them:

11, 11, 11, 11, 11

meaning that 2 values have been discarded.

Method: ‘l_sizes’

‘l_sizes’ divides up the data by a list of group sizes.
Excess data points are placed in extra group at the end.

n is a list/vector of group sizes

Example

We have a vector with 57 values. We want to get back 3 groups containing 20%, 30% and 50% of the data points.

‘l_sizes’ with n = c(0.2, 0.3) would return groups with this many values in them:

11, 17, 29

By setting force_equal to TRUE, ‘l_sizes’ discard any excess elements.

Example

‘l_sizes’ with n = c(0.2, 0.3) and force_equal set to TRUE would return groups with this many values in them:

11, 17

meaning that 29 values have been discarded.

Method: ‘l_starts’

‘l_starts’ starts new groups at specified values of vector.

n is a list of starting positions. Skip values by c(value, skip_to_number) where skip_to_number is the nth appearance of the value in the vector. Groups automatically start from first data point.

If passing n = ‘auto’ the starting positions are automatically found with find_starts().

If data is a data frame, starts_col must be set to indicate the column to match starts.
Set starts_col to ‘index’ or ‘.index’ for matching with row names. ‘index’ first looks for column named ‘index’ in data, while ‘.index’ completely ignores potential column in data named ‘.index’.

Example

We have a vector with 57 values ranging from (1:57). We want to get back groups starting at specific values in the vector.

‘l_starts’ with n = c(1, 3, 7, 25, 50) would return groups with this many values in them:

2, 4, 18, 25, 8

force_equal does not have any effect with method ‘l_starts’.

Skipping

Groups can start at nth appearance of the value by using c(value, skip_to_number).

Example

We have a vector with the values c(“a”, “e”, “o”, “a”, “e”, “o”) and want to start groups at the first “a”, the first following “e” and the second following “o”.

‘l_starts’ with n = list(“a”, “e”, c(“o”, 2)) would return groups with this many values in them:

1, 4, 1

Automatically find group starts

Using the find_starts() function, ‘l_starts’ is capable of finding the beginning of groups automatically.
A group start is a value which differs from the previous value.

Example

We have a vector with the values c(“a”, “a”, “o”, “o”, “o”, “a”, “a”) and want to automatically discover groups of data and group them.

‘l_starts’ with n = ‘auto’ would return groups with this many values in them:

2, 3, 2

find_starts()

find_starts() finds group starts in a given vector.
A group start is a value which differs from the previous value.
Setting return_index to TRUE returns indices of group starts.

Example

We have a vector with the values c(“a”, “a”, “o”, “o”, “o”, “a”, “a”) and want to automatically discover group starts.

find_starts() would return these group starts:

“a”, “o”, “a”

find_missing_starts()

find_missing_starts() tells you the values and (optionally) skip_to numbers that would be recursively removed when using the ‘l_starts’ method with the remove_missing_starts argument set to TRUE.
Set return_skip_numbers to FALSE to get only the missing values without the skip_to numbers.

Example

We have a vector with the values c(“a”, “a”, “o”, “o”, “o”, “a”, “a”) and a vector of starting positions c(“a”,“d”,“o”,“p”,“a”).

find_missing_starts() would return this list of values and skip_to numbers:

list(c(“d”,1), c(“p”,1))

Method: ‘staircase’

‘staircase’ uses step_size to divide up the data.
For each group, the group size will be step size multiplied with the group index.

Example

We have a vector with 57 values. We specify a step size of 5.

‘staircase’ with default settings would return groups with this many values in them:

5, 10, 15, 20, 7

By setting force_equal to TRUE, ‘staircase’ will discard the last group if it does not contain the expected values (step size multiplied by group index).

Example

‘staircase’ with force_equal set to TRUE would return groups with this many values in them:

5, 10, 15, 20

meaning that 7 values have been discarded.

Find remainder - %staircase%

When using the staircase method the last group might not have the size of the second last group + step size.
Use %staircase% to find the remainder.

If the last group has the size of the second last group + step size, %staircase% will return 0.

Example

%staircase% on a vector with size 57 and step size of 5 would look like this:

57 %staircase% 5

and return:

7

meaning that the last group would contain 7 values

Method: ‘primes’

‘primes’ creates groups with sizes of primary numbers in a staircasing design. n is the prime number to start at (size of first group).

Prime numbers are generated with the ‘numbers’ package by Hans Werner Borchers.

Example

We have a vector with 57 values. We specify n (start at) as 5.

‘primes’ with default settings would return groups with this many values in them:

5, 7, 11, 13, 17, 4

By setting force_equal to TRUE, ‘primes’ will discard the last group if it does not contain the expected number of values.

Example

‘primes’ with force_equal set to TRUE would return groups with this many values in them:

5, 7, 11, 13, 17

meaning that 4 values have been discarded.

Find remainder - %primes%

When using the primes method, the last group might not have the size of the associated prime number, if there are not enough elements. Use %primes% to find the remainder.

Returns 0 if the last group has the size of the associated prime number.

Example

%primes% on a vector with size 57 and n (start at) as 5 would look like this:

57 %primes% 5

and return:

4

meaning that the last group would contain 4 values

Balancing ID Methods

There are currently 4 methods for balancing on ID level in balance().

ID method: ‘n_ids’

Balances on ID level only. It makes sure there are the same number of IDs in each category. This might lead to a different number of rows between categories.

ID method: ‘n_rows_c’

Attempts to level the number of rows per category, while only removing/adding entire IDs.
This is done in 2 steps:

If a category needs to add all its rows one or more times, the data is repeated.
Iteratively, the ID with the number of rows closest to the lacking/excessive number of rows is added/removed. This happens until adding/removing the closest ID would lead to a size further from the target size than the current size. If multiple IDs are closest, one is randomly sampled.

ID method: ‘distributed’

Distributes the lacking/excess rows equally between the IDs. If the number to distribute can not be equally divided, some IDs will have 1 row more/less than the others.

ID method: ‘nested’

Balances the IDs within their categories, meaning that all IDs in a category will have the same number of rows.

Arguments

Grouping arguments

These are the arguments for group_factor(), group(), splt(), fold(), partition()

data

Type: data frame or vector

The data to process.

Used in: group_factor(), group(), splt(), fold(), partition()

n

Type: integer, numeric, character, or list

n represents either number of groups (default), group size, list of group sizes, list of group starts, step size or prime number to start at, depending on which method is specified.
n can be given as a whole number(s) (n > 1) or as percentage(s) (0 < n < 1).

Method l_starts allows n = ‘auto’.

Used in: group_factor(), group(), splt()

method

Type: character

Choose which method to use when dividing up the data.
Available methods: greedy, n_dist, n_fill, n_last, n_rand, l_starts, l_sizes, staircase, or primes

Used in: group_factor(), group(), splt(), fold()

starts_col

Type: character

Name of column with values to match in method ‘l_starts’ when data is a data frame.

Pass ‘index’ or ‘.index’ to use rownames. ‘index’ first looks for column named ‘index’ in data, while ‘.index’ completely ignores potential column in data named ‘.index’.

Used in: group_factor(), group(), splt()

force_equal

Type: logical (TRUE or FALSE)

If you need groups with the exact same size, set force_equal to TRUE.
Implementation is different in the different methods. Read more in their sections above.
Be aware that this setting discards excess datapoints!

Used in: group_factor(), group(), splt(), partition()

allow_zero

Type: logical (TRUE or FALSE)

If you set n to 0, you get an error.
If you don’t want this behavior, you can set allow_zero to TRUE, and (depending on the function) you will get the following output:

group_factor() will return the factor with NAs instead of numbers. It will be the same length as expected.

group() will return the expected data frame with NAs instead of a grouping factor.

splt() functions will return the given data (data frame or vector) in the same list format as if it had been split.

Used in: group_factor(), group(), splt()

descending

Type: logical (TRUE or FALSE)

In methods like ‘n_fill’ where it makes sense to change the direction of the method, you can use this argument.
In ‘n_fill’ it fills up the excess data points starting from the last group instead of the first.
NB. Only some of the methods can use this argument.

Used in: group_factor(), group(), splt()

randomize

Type: logical (TRUE or FALSE)

After creating the the grouping factor using the chosen method, it is possible to randomly reorganize it before returning it. Notice that this applies to all the functions that allows for the argument, as group() and splt() uses the grouping factor!

Used in: group_factor(), group(), splt()

N.B. fold() and partition() always uses some randomization.

col_name

Type: character

Name of added grouping factor column. Allows multiple grouping factors in a data frame.

Used in: group()

remove_missing_starts

Type: logical (TRUE or FALSE)

Recursively remove elements from the list of starts that are not found. For method ‘l_starts’ only.

Used in: group_factor(), group(), splt()

k

Type: integer or numeric

k represents either number of folds (default), fold size, or step size, depending on which method is specified.
k can be given as a whole number (k > 1) or as a percentage (0 < k < 1).

Used in: fold()

p

Type: integer or numeric

Size(s) of partition(s). Passed as vector if specifying multiple partitions.
p can be given as whole number(s) (p > 1) or as percentage(s) (0 < p < 1).

Used in: partition()

cat_col

Type: categorical vector or factor (passed as column name)

Categorical variable to balance between the groups.

E.g. when predicting a binary variable (‘a’ or ‘b’), we usually want both classes represented in every fold and partition.

N.B. If also passing id_col, cat_col should be a constant within IDs.
E.g. a participant must always have the same diagnosis (‘a’ or ‘b’) throughout the dataset. Otherwise, the participant might be placed in multiple folds.

Used in: fold(), partition()

num_col

Type: numerical vector (passed as column name)

Numerical variable to balance between groups.

N.B. When used with id_col, values for each ID are aggregated using id_aggregation_fn before being balanced.
N.B. When passing num_col, the method argument is not used.

Used in: fold(), partition()

id_col

Type: Factor (passed as column name)

Factor with IDs. This will be used to keep all rows with an ID in the same group (if possible).

E.g. If we have measured a participant multiple times and want to see the effect of time, we want to have all observations of this participant in the same fold/partition.

Used in: fold(), partition()

id_aggregation_fn

Type: Function

Function for aggregating values in num_col for each ID, before balancing by num_col.

N.B. Only used when num_col and id_col are both specified.

Used in: fold(), partition()

extreme_pairing_levels

Type: integer or numeric

How many levels of extreme pairing to do when balancing groups by num_col.

Extreme pairing: Rows/pairs are ordered as smallest, largest, second smallest, second largest, etc. If extreme_pairing_levels > 1, this is done “recursively” on the extreme pairs.

N.B. Values greater than 1 works best with large datasets. Always check if an increase actually makes the groups more balanced. There are examples of how to do this, and more detailed descriptions of the implementations, in the functions’ help files (?fold and ?partition).

Used in: fold(), partition()

num_fold_cols

Type: integer or numeric

Number of fold columns to create. This is useful for repeated cross-validation. If num_fold_cols > 1, columns will be named “.folds_1”, “.folds_2”, etc. Otherwise simply “.folds”.

N.B. If unique_fold_cols_only is TRUE, we can end up with fewer columns than specified, see max_iters.

N.B. If data has existing fold columns, see handle_existing_fold_cols.

Used in: fold()

unique_fold_cols_only

Type: logical (TRUE or FALSE)

Check if the fold columns are identical and keep only the unique columns.

N.B. As the number of column comparisons can be time consuming, we can run this part in parallel. See parallel.

N.B. We can end up with fewer columns than specified in num_fold_cols, see max_iters.

N.B. Only used when num_fold_cols > 1 or data has existing fold columns.

Used in: fold()

max_iters

Type: logical (TRUE or FALSE)

Maximum number of attempts at reaching num_fold_cols unique fold columns.

When only keeping the unique fold columns, we risk having fewer columns than expected. Hence, we repeatedly create the missing columns and remove those that are not unique. This is done until we have num_fold_cols unique fold columns, or we have attempted max_iters times. In some cases, it is not possible to create num_fold_cols unique combinations of the dataset, e.g. when specifying cat_col, id_col and num_col.
max_iters specifies when to stop trying.

N.B. We can end up with fewer columns than specified in num_fold_cols.

N.B. Only used num_fold_cols > 1.

Used in: fold()

handle_existing_fold_cols

Type: Character

How to handle existing fold columns. Either “keep_warn”, “keep”, or “remove”.

To add extra fold columns, use “keep” or “keep_warn”. Note that existing fold columns might be renamed.

To replace the existing fold columns, use “remove”.

Used in: fold()

parallel

Type: logical (TRUE or FALSE)

Whether to parallelize the fold column comparisons, when unique_fold_cols_only is TRUE.

N.B. Requires a registered parallel backend. Like doParallel:registerDoParallel.

Used in: fold()

list_out

Type: logical (TRUE or FALSE)

Return list of partitions (TRUE) or a grouped data frame (FALSE).

Used in: partition()

Balancing arguments

These are the arguments for balance(), upsample(), downsample()