pivot()
vs Base R aggregate()
Aggregation is the process forming groups of data based on the levels of categorical variables, then computing some statistical value, such as a mean, for each group on a numeric variable. For example, compute the mean Salary for each combination of the levels of Gender and Department. Refer to each group, the cross-classification of all the specified categorical variables, as a cell.
The lessR pivot()
function performs this aggregation by relying upon the base R function aggregate()
. By default, pivot()
generates a long-form data frame pivot table (Excel terminology), which can then be directly input into analysis functions as a standard data frame. The levels for all the grouping variables are listed in the rows, but if there are specified column grouping variables, pivot()
relies upon base R reshape()
to form a 2-d table for direct viewing instead of a data table to input into further analysis functions.
pivot()
provides additional features than aggregate()
provides.
value
over which to aggregate, the sample size for each cell is provided with the number of missing values.value
aggregated.Also, there is a different interface as the variables, for rows and columns, are each specified as a vector.
The following table lists many of the available statistical functions by which to aggregate over the cells.
Statistic | Meaning |
---|---|
sum |
sum |
mean |
arithmetic mean |
median |
median |
min |
minimum |
max |
maximum |
sd |
standard deviation |
var |
variance |
IQR |
inter-quartile range |
mad |
mean absolute deviation |
tabulate |
count of each cell only |
Some statistical functions are available that return multiple values.
Statistic | Meaning |
---|---|
range |
minimum, maximum |
quantile |
range + quartiles |
summary |
quantile + mean |
These later three functions output their values as an R matrix, which then replaces the values of the variable that is aggregated in the resulting output data frame.
The following pivot()
parameters specify the data, the statistic for the aggregation, the value(s) over which to aggregate, and the corresponding cells that contain the aggregated values. The first four parameter values listed below are required.
data
: The data frame that includes the variables of interest.compute
: The function for which to perform the aggregation.value
: The variable(s) for which to summarize, i.e., aggregate.rows_by
: The categorical variable(s) that define the sub-groups or cells for which to compute the aggregated values.cols_by
: The optional categorical variable(s) that define the sub-groups or cells for which to compute the aggregated values, listed as columns in a two-dimensional table.Multiple values of parameters value
, rows_by
, and up to two cols_by
variables may be specified. Express the multiple categorical variables over which to pivot as a vector, such as with the c()
function. If cols_by
is not specified, the result is a linear table as a long-form data frame that can be input into other data analysis procedures.
By default, missing values are eliminated from the aggregation. If there are no values for a cell for which to perform the aggregation, and it is desired to have the aggregated value listed as missing, then specify na_value
as TRUE
. If any of the levels of the by
variables are missing, to report those missing cells, specify na_by
as TRUE
. Set na_remove
to TRUE
to have any aggregated value defined as missing if any individual data cell for value
is missing.
To illustrate, use the Employee data set included in lessR, here read into the d data frame. Begin with no missing data.
##
## >>> Suggestions
## Details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 Years integer 36 1 16 7 NA 15 ... 1 2 10
## 2 Gender character 37 0 2 M M M ... F F M
## 3 Dept character 36 1 5 ADMN SALE SALE ... MKTG SALE FINC
## 4 Salary double 37 0 37 53788.26 94494.58 ... 56508.32 57562.36
## 5 JobSat character 35 2 3 med low low ... high low high
## 6 Plan integer 37 0 3 1 1 3 ... 2 2 1
## 7 Pre integer 37 0 27 82 62 96 ... 83 59 80
## 8 Post integer 37 0 22 92 74 97 ... 90 71 87
## ------------------------------------------------------------------------------------------
Two categorical variables in the d data frame are Dept and Salary. A continuous variable is Salary. Create the long-form pivot table as a data frame that expresses the mean of Salary for all combinations of Dept and Salary.
## Dept Gender n miss Salary
## 1 ACCT F 3 0 63237.16
## 2 ADMN F 4 0 81434.00
## 3 FINC F 1 0 57139.90
## 4 MKTG F 5 0 64496.02
## 5 SALE F 5 0 64188.25
## 6 ACCT M 2 0 59626.20
## 7 ADMN M 2 0 80963.35
## 8 FINC M 3 0 72967.60
## 9 MKTG M 1 0 99062.66
## 10 SALE M 10 0 86150.97
The output of pivot()
is a data frame of the aggregated variables. This can be saved for further analysis. Here, perform the same analysis, but list the required parameter values in order without the parameter names.
Because the output of pivot()
with no cols_by
variables is a standard R data frame, typical operations such as sorting can be applied, here using the lessR function Sort()
.
##
## Sort Specification
## 1 --> ascending
## Dept Gender n miss Salary
## 1 ACCT F 3 0 63237.16
## 6 ACCT M 2 0 59626.20
## 2 ADMN F 4 0 81434.00
## 7 ADMN M 2 0 80963.35
## 3 FINC F 1 0 57139.90
## 8 FINC M 3 0 72967.60
## 4 MKTG F 5 0 64496.02
## 9 MKTG M 1 0 99062.66
## 5 SALE F 5 0 64188.25
## 10 SALE M 10 0 86150.97
Multiple value
variables for which to aggregate over can also be specified. Round the numerical aggregated results to two decimal digits.
## Dept Gender n miss Years Salary
## 1 ACCT F 3 0 4.67 63237.16
## 2 ADMN F 4 0 7.50 81434.00
## 3 FINC F 1 0 7.00 57139.90
## 4 MKTG F 5 0 8.20 64496.02
## 5 SALE F 5 0 6.60 64188.25
## 6 ACCT M 2 0 7.00 59626.20
## 7 ADMN M 2 0 15.50 80963.34
## 8 FINC M 3 0 11.33 72967.60
## 9 MKTG M 1 0 18.00 99062.66
## 10 SALE M 10 0 12.33 86150.97
The following illustrates the pipe operator from the magrittr package with pivot()
. The package is not included with lessR, so separately load with library()
, either by itself or as part of the tidyverse package. As such, the following code is not run.
Perform the pivot computations on the d data frame. Then output the aggregated results to the a data frame.
Specify up to two cols_by
categorical variables, to create a two-dimensional table with the specified columns. First, one cols_by
variable, Gender. Specifying one or two categorical variables as cols_by
variables moves them from their default position in the rows to the columns, which changes the output structure from a long-form data frame to a cross-classification table with categorical variables in the rows and columns.
Table: mean of Salary
Gender F M
Dept
------- --------- ---------
ACCT 63237.16 59626.20
ADMN 81434.00 80963.34
FINC 57139.90 72967.60
MKTG 64496.02 99062.66
SALE 64188.25 86150.97
Here two cols_by
variables, specified as a vector.
Table: mean of Salary
Gender F M
Plan 1 2 3 1 2 3
Dept
------- --------- --------- --------- --------- ---------- ---------
MKTG 56772.95 66426.79 NA NA NA 99062.66
SALE 60941.54 66352.73 NA 89393.40 82442.74 80131.91
ACCT NA 63237.16 NA 69547.60 NA 49704.79
ADMN NA 67724.21 122563.4 53788.26 108138.43 NA
FINC NA 57139.90 NA 61937.62 NA 95027.55
Tabulation is counting. With tabulation there is no value
variable per se. Instead, the number of data values in each cell are tabulated, that is, counted.
## Dept Gender n
## 1 ACCT F 3
## 2 ADMN F 4
## 3 FINC F 1
## 4 MKTG F 5
## 5 SALE F 5
## 6 ACCT M 2
## 7 ADMN M 2
## 8 FINC M 3
## 9 MKTG M 1
## 10 SALE M 10
The pivot table follows with the missing data analysis when there is only one numerical value over which to aggregate. The variable Years has one missing value.
## by n miss Years
## 1 ACCT 5 0 5.600000
## 2 ADMN 6 0 10.166667
## 3 FINC 4 0 10.250000
## 4 MKTG 6 0 9.833333
## 5 SALE 14 1 10.285714
Set na_remove
to FALSE
to not remove any missing data in a cell with values to be aggregated. The resulting aggregated value will be missing if any of the constituent data values are missing. The corresponding level does not appear because cells with aggregated missing values are not shown.
## by n miss Years
## 1 ACCT 5 0 5.600000
## 2 ADMN 6 0 10.166667
## 3 FINC 4 0 10.250000
## 4 MKTG 6 0 9.833333
The na_remove
parameter specifies the value of the base R parameter na.rm
for computations such as for the mean
. See ?mean
for its definition.
Here include all the cells, even those with a missing aggregated value.
## by n miss Years
## 1 ACCT 5 0 5.600000
## 2 ADMN 6 0 10.166667
## 3 FINC 4 0 10.250000
## 4 MKTG 6 0 9.833333
## 5 SALE 14 1 NA
To account for missing values of the categorical by
variables, set na_by
to TRUE
.
## Dept Gender n miss Years
## 1 ACCT F 3 0 4.666667
## 2 ADMN F 4 0 7.500000
## 3 FINC F 1 0 7.000000
## 4 MKTG F 5 0 8.200000
## 5 SALE F 5 0 6.600000
## 6 <NA> F 1 0 5.000000
## 7 ACCT M 2 0 7.000000
## 8 ADMN M 2 0 15.500000
## 9 FINC M 3 0 11.333333
## 10 MKTG M 1 0 18.000000
## 11 SALE M 9 1 12.333333
Specify the count for levels that include missing cells. Invoke na_by
for tabulation.
## Dept Gender n
## 1 ACCT F 3
## 2 ADMN F 4
## 3 FINC F 1
## 4 MKTG F 5
## 5 SALE F 5
## 6 <NA> F 1
## 7 ACCT M 2
## 8 ADMN M 2
## 9 FINC M 3
## 10 MKTG M 1
## 11 SALE M 10
## 12 <NA> M 0