Data are stored in many different ways in tables or spreadsheets because no strict semantic or topographic standards for the organisation of tables are commonly accepted. In the R environment the tidy paradigm is a first step towards interoperability of data, in that it requires a certain arrangement of tables, where variables are recorded in columns and observations in rows (see https://tidyr.tidyverse.org/). Tables can be tidied (i.e., brought into a tidy arrangement) via packages such as tidyr
, however, all functions that deal with reshaping tables to date require data that are already organised into topologically coherent, rectangular tables. This is often violated in practice, for example when data are scraped off of the internet.
tabshiftr
fills this gap in the toolchain for reproducible data management via schema
descriptions and a reorganise()
function that is largely based on tidyr
. We call data that are not topologically coherent disorganised (messy) data and they are characterised not only by “being messy” (as in opposed to tidy), but also by being arranged so that not all variables are part of the same “chunk” of data or even that some variables are only available by implication.
or the latest development version from github:
A disorganised table may look like the following table:
library(readr)
library(tabshiftr)
library(knitr)
input <- read_csv(file = paste0(system.file("test_datasets", package = "tabshiftr"), "/table13.csv"),
col_names = FALSE, col_types = cols(.default = "c"))
kable(input)
X1 | X2 | X3 | X4 | X5 | X6 | X7 |
---|---|---|---|---|---|---|
commodities | harvested | production | . | . | . | . |
unit 1 | . | . | . | . | . | . |
soybean | 1111 | 1112 | year 1 | . | . | . |
maize | 1121 | 1122 | year 1 | . | . | . |
soybean | 1211 | 1212 | year 2 | . | . | . |
maize | 1221 | 1222 | year 2 | . | . | . |
. | . | . | . | . | . | . |
commodities | harvested | production | commodities | harvested | production | . |
unit 2 | . | . | unit 3 | . | . | . |
soybean | 2111 | 2112 | soybean | 3111 | 3112 | year 1 |
maize | 2121 | 2122 | maize | 3121 | 3122 | year 1 |
soybean | 2211 | 2212 | soybean | 3211 | 3212 | year 2 |
maize | 2221 | 2222 | maize | 3221 | 3222 | year 2 |
This table contains two chunks of data at the rows 1 to 6 and 8 to 13 and they are further organised into three similar groups (so-called clusters). The clusters are somewhat systematic in that there are three variables in common, commodities
, harvested
and production
. The variable year
has values but no explicit name and is not available for each cluster, and the variable territories
is not even stored in a column, but is implicit and noted in the second row of each cluster.
In tabshiftr
we distinguish between identifying variables, variables that describe an observational unit, and measured variables, variables that provide some value for that observational unit. In the example, territories
, year
and commodities
are identifying variables and harvested
and production
are measured variables, they give the harvested area and the production of the given commodity at a given year in a given territory.
The approach of tabshiftr
is based on capturing the arrangement of tables in so-called schema description. Typically there is an input and an output schema, describing the arrangement of the input and output tables, respectively. The advantage of this procedure is that input and output tables exist explicitly and the schema maps the transformation of the data. As we want to end up with tidy tables the output schema is pre-determined by a tidy table of the included variables, but the input schema description needs to be provided by the user.
For the input schema we have to describe, according to a hierarchical set of rules, in which rows and columns of a table which information can be found (see the example below). First, we need to describe whether the table is organised in several clusters, and if so where they are located. In the example, the “origin” of the first cluster is at the first row from the top and the first cell from left (1|1
), the second at 1|8
and the third cluster originates at 4|8
. All clusters are 3 cells wide and 6 cells high. Moreover, we already recognised that each cluster represents one territorial unit, so we record the variable territories
as cluster ID. A table should have some sort of header, i.e., one or more rows that describes which information a column contains and we need to register this header, in the example this is the first row of each cluster. When there are several clusters that are identical in arrangement (but not in values), it makes sense to describe them in terms of rel
ative values, i.e., starting to count from the origin of the cluster instead of the spreadsheet.
Input tables may contain many more data/variables than what we are interested in, but the schema description contains only those variables we want to have in the output table. For each variable we need to define at least the type
and either col
umn(s), or column(s) and row
(s). Some variables are dist
inct from the cluster outline, because they occur perhaps only once in the table, or are organised in a non-systematic way, such as the variable year
in the example. Some variables are either absent from the table, or are available only implicitly, for example if several spreadsheets or files contain information for one level of an identifying variable, such as per commodity. This is not the case in the example, but if it were so, this variable could be specified by providing the value
of the level in the variable. For measured variables we also need to provide the target unit and a transformation factor to that unit, the latter of which will be multiplied with the values of the measured variable. This results in the following schema description for our example:
library(tabshiftr)
schema <- makeSchema(schema = list(
clusters =
list(row = c(1, 8, 8), col = c(1, 1, 4), width = 3, height = 6, id = "territories"),
header = list(row = 1, rel = TRUE),
variables =
list(territories =
list(type = "id", row = 1, col = 1, rel = TRUE),
year =
list(type = "id", row = c(3:6), col = 4, dist = TRUE),
commodities =
list(type = "id", col = 1, rel = TRUE),
harvested =
list(type = "measured", unit = "ha", factor = 1, col = 2, rel = TRUE),
production =
list(type = "measured", unit = "t", factor = 1, col = 3, rel = TRUE))
))
Some further fields that have not been mentioned here are:
split
, to separate an identifying variable via a regular expression from some columnkey
and value
, when the names of measured variables are provided in a column while the respective values are in a separate column. Here, key
is the column that contains the names and value
is the level of the measured variable that should be “pulled” out of key
.schema
#> 3 clusters
#> origin: 1|1, 8|1, 8|4 (row|col)
#> id : territories
#>
#> variable type row col rel dist
#> ------------- ---------- ----- ----- ----- ------
#> territories id 1 1 T F
#> year id 3:6 4 F T
#> commodities id 1 T F
#> harvested measured 2 T F
#> production measured 3 T F
Finally, the input table is reorganised simply by calling reorganise()
.
territories | year | commodities | harvested | production |
---|---|---|---|---|
unit 1 | year 1 | maize | 1121 | 1122 |
unit 1 | year 1 | soybean | 1111 | 1112 |
unit 1 | year 2 | maize | 1221 | 1222 |
unit 1 | year 2 | soybean | 1211 | 1212 |
unit 2 | year 1 | maize | 2121 | 2122 |
unit 2 | year 1 | soybean | 2111 | 2112 |
unit 2 | year 2 | maize | 2221 | 2222 |
unit 2 | year 2 | soybean | 2211 | 2212 |
unit 3 | year 1 | maize | 3121 | 3122 |
unit 3 | year 1 | soybean | 3111 | 3112 |
unit 3 | year 2 | maize | 3221 | 3222 |
unit 3 | year 2 | soybean | 3211 | 3212 |
setCluster()
, setHeader()
and setVariable()
, which could be piped into one another and finally into makeSchema()
.Contributions to those points and discussions on where tabshiftr should go are highly welcome!
This work was supported by funding to Carsten Meyer through the Flexpool mechanism of the German Centre for Integrative Biodiversity Research (Div) (FZT-118, DFG).