cdata

John Mount, Win-Vector LLC

2020-02-01

The cdata package is a demonstration of the “coordinatized data” theory and includes an implementation of the “fluid data” methodology.

Briefly cdata supplies data transform operators that:

A quick example:

library("cdata")

# first few rows of the iris data as an example
d <- wrapr::build_frame(
   "Sepal.Length"  , "Sepal.Width", "Petal.Length", "Petal.Width", "Species" |
     5.1           , 3.5          , 1.4           , 0.2          , "setosa"  |
     4.9           , 3            , 1.4           , 0.2          , "setosa"  |
     4.7           , 3.2          , 1.3           , 0.2          , "setosa"  |
     4.6           , 3.1          , 1.5           , 0.2          , "setosa"  |
     5             , 3.6          , 1.4           , 0.2          , "setosa"  |
     5.4           , 3.9          , 1.7           , 0.4          , "setosa"  )
d$iris_id <- seq_len(nrow(d))

knitr::kable(d)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species iris_id
5.1 3.5 1.4 0.2 setosa 1
4.9 3.0 1.4 0.2 setosa 2
4.7 3.2 1.3 0.2 setosa 3
4.6 3.1 1.5 0.2 setosa 4
5.0 3.6 1.4 0.2 setosa 5
5.4 3.9 1.7 0.4 setosa 6

Now suppose we want to take the above “all facts about each iris are in a single row” representation and convert it into a per-iris record block with the following structure.

record_example <- wrapr::qchar_frame(
   "plant_part"  , "measurement", "value"      |
     "sepal"     , "width"      , Sepal.Width  |
     "sepal"     , "length"     , Sepal.Length |
     "petal"     , "width"      , Petal.Width  |
     "petal"     , "length"     , Petal.Length )

knitr::kable(record_example)
plant_part measurement value
sepal width Sepal.Width
sepal length Sepal.Length
petal width Petal.Width
petal length Petal.Length

The above sort of transformation may seem exotic, but it is fairly common when we want to plot many aspects of a record at the same time.

To specify our transformation we combine the record example with information about how records are keyed (recordKeys showing which rows go together to form a record, and controlTableKeys specifying the internal structure of a data record).

layout <- rowrecs_to_blocks_spec(
  record_example,
  controlTableKeys = c("plant_part", "measurement"),
  recordKeys = c("iris_id", "Species"))

print(layout)
## {
##  row_record <- wrapr::qchar_frame(
##    "iris_id"  , "Species", "Sepal.Width", "Sepal.Length", "Petal.Width", "Petal.Length" |
##      .        , .        , Sepal.Width  , Sepal.Length  , Petal.Width  , Petal.Length   )
##  row_keys <- c('iris_id', 'Species')
## 
##  # becomes
## 
##  block_record <- wrapr::qchar_frame(
##    "iris_id"  , "Species", "plant_part", "measurement", "value"      |
##      .        , .        , "sepal"     , "width"      , Sepal.Width  |
##      .        , .        , "sepal"     , "length"     , Sepal.Length |
##      .        , .        , "petal"     , "width"      , Petal.Width  |
##      .        , .        , "petal"     , "length"     , Petal.Length )
##  block_keys <- c('iris_id', 'Species', 'plant_part', 'measurement')
## 
##  # args: c(checkNames = TRUE, checkKeys = FALSE, strict = FALSE, allow_rqdatatable = TRUE)
## }

In the above we have used the common useful data organizing trick of specifying a dependent column (Species being a function of iris_id) as an additional key.

This layout then specifies and implements the data transform. We can transform the data by sending it to the layout.

d_transformed <- d %.>% 
  layout

knitr::kable(d_transformed)
iris_id Species plant_part measurement value
1 setosa sepal width 3.5
1 setosa sepal length 5.1
1 setosa petal width 0.2
1 setosa petal length 1.4
2 setosa sepal width 3.0
2 setosa sepal length 4.9
2 setosa petal width 0.2
2 setosa petal length 1.4
3 setosa sepal width 3.2
3 setosa sepal length 4.7
3 setosa petal width 0.2
3 setosa petal length 1.3
4 setosa sepal width 3.1
4 setosa sepal length 4.6
4 setosa petal width 0.2
4 setosa petal length 1.5
5 setosa sepal width 3.6
5 setosa sepal length 5.0
5 setosa petal width 0.2
5 setosa petal length 1.4
6 setosa sepal width 3.9
6 setosa sepal length 5.4
6 setosa petal width 0.4
6 setosa petal length 1.7

And it is easy to invert these transforms using the t() transpose/adjoint notation.

inv_layout <- t(layout)

print(inv_layout)
## {
##  block_record <- wrapr::qchar_frame(
##    "iris_id"  , "Species", "plant_part", "measurement", "value"      |
##      .        , .        , "sepal"     , "width"      , Sepal.Width  |
##      .        , .        , "sepal"     , "length"     , Sepal.Length |
##      .        , .        , "petal"     , "width"      , Petal.Width  |
##      .        , .        , "petal"     , "length"     , Petal.Length )
##  block_keys <- c('iris_id', 'Species', 'plant_part', 'measurement')
## 
##  # becomes
## 
##  row_record <- wrapr::qchar_frame(
##    "iris_id"  , "Species", "Sepal.Width", "Sepal.Length", "Petal.Width", "Petal.Length" |
##      .        , .        , Sepal.Width  , Sepal.Length  , Petal.Width  , Petal.Length   )
##  row_keys <- c('iris_id', 'Species')
## 
##  # args: c(checkNames = TRUE, checkKeys = FALSE, strict = FALSE, allow_rqdatatable = TRUE)
## }
d_transformed %.>%
  inv_layout %.>%
  knitr::kable(.)
iris_id Species Sepal.Width Sepal.Length Petal.Width Petal.Length
1 setosa 3.5 5.1 0.2 1.4
2 setosa 3.0 4.9 0.2 1.4
3 setosa 3.2 4.7 0.2 1.3
4 setosa 3.1 4.6 0.2 1.5
5 setosa 3.6 5.0 0.2 1.4
6 setosa 3.9 5.4 0.4 1.7

The layout specifications themselves are just simple lists with “pretty print methods” (the control table being simply and example record in the form of a data.frame).

unclass(layout)
## $controlTable
##   plant_part measurement        value
## 1      sepal       width  Sepal.Width
## 2      sepal      length Sepal.Length
## 3      petal       width  Petal.Width
## 4      petal      length Petal.Length
## 
## $recordKeys
## [1] "iris_id" "Species"
## 
## $controlTableKeys
## [1] "plant_part"  "measurement"
## 
## $checkNames
## [1] TRUE
## 
## $checkKeys
## [1] FALSE
## 
## $strict
## [1] FALSE
## 
## $allow_rqdatatable
## [1] TRUE

Notice that almost all of the time and space in using cdata is spent in specifying how your data is structured and is to be structured.

The main cdata interfaces are given by the following set of methods:

Some convenience functions include:

The package vignettes can be found in the “Articles” tab of the cdata documentation site.

The (older) recommended tutorial is: Fluid data reshaping with cdata. We also have an (older) short free cdata screencast (and another example can be found here).