tabshiftr

CRAN_Status_Badge Travis-CI Build Status AppVeyor build status Coverage Status Lifecycle:maturing

Overview

Data are stored in many different ways in tables or spreadsheets because no strict semantic or topographic standards for the organisation of tables are commonly accepted. In the R environment the tidy paradigm is a first step towards interoperability of data, in that it requires a certain arrangement of tables, where variables are recorded in columns and observations in rows (see https://tidyr.tidyverse.org/). Tables can be tidied (i.e., brought into a tidy arrangement) via packages such as tidyr, however, all functions that deal with reshaping tables to date require data that are already organised into topologically coherent, rectangular tables. This is often violated in practice, for example when data are scraped off of the internet.

tabshiftr fills this gap in the toolchain for reproducible data management via schema descriptions and a reorganise() function that is largely based on tidyr. We call data that are not topologically coherent disorganised (messy) data and they are characterised not only by “being messy” (as in opposed to tidy), but also by being arranged so that not all variables are part of the same “chunk” of data or even that some variables are only available by implication.

Installation

  1. Install the official version from CRAN:
install.packages("tabshiftr")

or the latest development version from github:

devtools::install_github("EhrmannS/tabshiftr")
  1. The vignette provides an instruction on how to set up schema descriptions and shows the table arrangements that can be reorganised in this version, and which schemas apply to them.

Examples

A disorganised table may look like the following table:

library(readr)
library(tabshiftr)
library(knitr)

input <- read_csv(file = paste0(system.file("test_datasets", package = "tabshiftr"), "/table13.csv"),
                  col_names = FALSE, col_types = cols(.default = "c"))
kable(input)
X1 X2 X3 X4 X5 X6 X7
commodities harvested production . . . .
unit 1 . . . . . .
soybean 1111 1112 year 1 . . .
maize 1121 1122 year 1 . . .
soybean 1211 1212 year 2 . . .
maize 1221 1222 year 2 . . .
. . . . . . .
commodities harvested production commodities harvested production .
unit 2 . . unit 3 . . .
soybean 2111 2112 soybean 3111 3112 year 1
maize 2121 2122 maize 3121 3122 year 1
soybean 2211 2212 soybean 3211 3212 year 2
maize 2221 2222 maize 3221 3222 year 2

This table contains two chunks of data at the rows 1 to 6 and 8 to 13 and they are further organised into three similar groups (so-called clusters). The clusters are somewhat systematic in that there are three variables in common, commodities, harvested and production. The variable year has values but no explicit name and is not available for each cluster, and the variable territories is not even stored in a column, but is implicit and noted in the second row of each cluster.

In tabshiftr we distinguish between identifying variables, variables that describe an observational unit, and measured variables, variables that provide some value for that observational unit. In the example, territories, year and commodities are identifying variables and harvested and production are measured variables, they give the harvested area and the production of the given commodity at a given year in a given territory.

The approach of tabshiftr is based on capturing the arrangement of tables in so-called schema description. Typically there is an input and an output schema, describing the arrangement of the input and output tables, respectively. The advantage of this procedure is that input and output tables exist explicitly and the schema maps the transformation of the data. As we want to end up with tidy tables the output schema is pre-determined by a tidy table of the included variables, but the input schema description needs to be provided by the user.

For the input schema we have to describe, according to a hierarchical set of rules, in which rows and columns of a table which information can be found (see the example below). First, we need to describe whether the table is organised in several clusters, and if so where they are located. In the example, the “origin” of the first cluster is at the first row from the top and the first cell from left (1|1), the second at 1|8 and the third cluster originates at 4|8. All clusters are 3 cells wide and 6 cells high. Moreover, we already recognised that each cluster represents one territorial unit, so we record the variable territories as cluster ID. A table should have some sort of header, i.e., one or more rows that describes which information a column contains and we need to register this header, in the example this is the first row of each cluster. When there are several clusters that are identical in arrangement (but not in values), it makes sense to describe them in terms of relative values, i.e., starting to count from the origin of the cluster instead of the spreadsheet.

Input tables may contain many more data/variables than what we are interested in, but the schema description contains only those variables we want to have in the output table. For each variable we need to define at least the type and either column(s), or column(s) and row(s). Some variables are distinct from the cluster outline, because they occur perhaps only once in the table, or are organised in a non-systematic way, such as the variable year in the example. Some variables are either absent from the table, or are available only implicitly, for example if several spreadsheets or files contain information for one level of an identifying variable, such as per commodity. This is not the case in the example, but if it were so, this variable could be specified by providing the value of the level in the variable. For measured variables we also need to provide the target unit and a transformation factor to that unit, the latter of which will be multiplied with the values of the measured variable. This results in the following schema description for our example:

library(tabshiftr)
schema <- makeSchema(schema = list(
  clusters =
    list(row = c(1, 8, 8), col = c(1, 1, 4), width = 3, height = 6, id = "territories"),
  header = list(row = 1, rel = TRUE),
  variables =
    list(territories =
           list(type = "id", row = 1, col = 1, rel = TRUE),
         year =
           list(type = "id", row = c(3:6), col = 4, dist = TRUE),
         commodities =
           list(type = "id", col = 1, rel = TRUE),
         harvested =
           list(type = "measured", unit = "ha", factor = 1, col = 2, rel = TRUE),
         production =
           list(type = "measured", unit = "t", factor = 1, col = 3, rel = TRUE))
))

Some further fields that have not been mentioned here are:

schema
#>   3 clusters
#>     origin: 1|1, 8|1, 8|4  (row|col)
#>     id    : territories
#> 
#>    variable      type       row   col   rel   dist 
#>   ------------- ---------- ----- ----- ----- ------  
#>    territories   id         1     1     T     F  
#>    year          id         3:6   4     F     T  
#>    commodities   id               1     T     F  
#>    harvested     measured         2     T     F  
#>    production    measured         3     T     F

Finally, the input table is reorganised simply by calling reorganise().

output <- reorganise(input = input, schema = schema)
kable(output)
territories year commodities harvested production
unit 1 year 1 maize 1121 1122
unit 1 year 1 soybean 1111 1112
unit 1 year 2 maize 1221 1222
unit 1 year 2 soybean 1211 1212
unit 2 year 1 maize 2121 2122
unit 2 year 1 soybean 2111 2112
unit 2 year 2 maize 2221 2222
unit 2 year 2 soybean 2211 2212
unit 3 year 1 maize 3121 3122
unit 3 year 1 soybean 3111 3112
unit 3 year 2 maize 3221 3222
unit 3 year 2 soybean 3211 3212

Contributions

Contributions to those points and discussions on where tabshiftr should go are highly welcome!

Acknowledgement

This work was supported by funding to Carsten Meyer through the Flexpool mechanism of the German Centre for Integrative Biodiversity Research (Div) (FZT-118, DFG).