1 Introduction

Spreadsheets as places where data tables are recorded can be pretty messy. Ideally, data are organised into topologically coherent (rectangular) chunks of tidy data, where variables are recorded in columns and observations in rows (Wickham 2014). However, in spreadsheets that were scraped off the internet, all thinkable arrangements of the data may be encountered, culminating in several non-uniformly arranged tables that are placed non-systematically within one spreadsheet. Each of such individual chunks within a spreadsheet is called cluster and data that are not organised into topologically coherent chunks are called disorganised (messy) data in tabshiftr.

When it comes to the organisation of data in tables, variables can be distinguished into two kinds of variables:

Variables that identify the unit for which values have been measured (they are called identifying variables here).
Variables that have been measured and that consequently represent the target values of that measurement, be they continuous or categorical (they are called measured variables here).

These two variable types are the target variables in tabshiftr. The primary aim of reorganising messy tables lays in determining where those two kinds of variables are located in each cluster.

2 Setting up schema descriptions

To set up a schema description, the recommended strategy is the following:

Clarify which are the identifying variables and which are the measured variables and create a new entry for each of them in the schema. A variable is always a combination of name and values. Names are typically the column names and values are typically in the column below that name. However, in many disorganised tables variable names of measured variables are used as values of an identifying variable or measured variables are nested in identifying variables and values are thus not exclusively in the same column as their name.
Determine whether there are clusters and find the origin (top left cell) of each cluster (Tab. 8, Tab. 9 & Tab. 10). Follow the next steps for each cluster…
Determine which variable identifies clusters and provide that as cluster ID. It can also be the case that the data are organised into separate spreadsheets or files according to one of the variables and also those cases should be treated as if they were clusters, even if each spreadsheet/file contains a topologically coherent table. It may be that either an identifying variable, or a measured variable identifies clusters:
- in case it is an identifying variable, provide this variables’ name (and also register its metadata).
- in case it is a measured variable, provide simply "measured" as cluster ID.
Determine for each identifying variable the following:
- is the variable available at all? This is particularly important when the data are split up into sub-tables that are in different spreadsheets or files. Often the variable that splits up the data (and thus identifies the clusters) is not explicitly available in the spreadsheet anymore. In such a case, provide the variables value for this cluster in value (Tab. 5).
- all columns in which the variable names sit.
- in case the variable is in several columns, determine additionally the row in which its names sit (Tab. 3).
- whether the variable must be split off of another column (Tab. 4).
- whether the variable is distinct from the main table (Tab. 11).
Determine for each measured variable the following:
- all columns in which the values of the variable sit (Tab. 6).
- the unit and conversion factor.
- in case the variable is not tidy, one of the three following cases should apply:
  - in case the variable is nested in a wide identifying variable, determine in addition to the columns in which the values sit also the rows in which the variable name sits (Tab. 2 & Tab. 3).
  - in case the names of the variable are given as a value of an identifying variable (Tab. 7), give the column name as key, together with the respective name of the measured variable in values.
  - in case the name of the variable is the ID of clusters, specify key = "cluster" and in values the cluster number the variable refers to (Tab. 10).

3 Table types

From the point of view of tabshiftr there are four dimensions of how data can be disorganised. Several of these dimensions can co-occur in one table, which might make it tricky for some tables to find the correct schema ad-hoc.

Either the levels of an id variable, or the measured variables are organised into separate topologically coherent chunks (clusters) of data, instead of all variables in one clusters.
The names of measured variables are treated as if they were the levels of an id variable (they are listed), instead of each measured variable in its own column.
The levels of one or more id variable are spread over several columns, instead of all id variables in only one column.
Several id variables are stored in one column.

The following sections are organised according to those dimensions, first introducing the problem more in detail and then giving a description of some typical table arrangements, followed by a schema description that can be used for their reorganisation. This is certainly not an exhaustive list of possible table arrangements, but it covers certain basic building blocks of the most common operations and should be extensible enough to capture many mutations of the presented tables.

3.1 Spreadsheet contains one cluster

3.1.1 Separated measured variables

In case the measured variables are arranged into individual columns (Tab. 1), we have tidy data (Wickham 2014), which are already in the target arrangement. The variables in a tidy table may however, still need different names, units and transformation factors.

Table 1: A tidy table.
X1	X2	X3	X4	X5
territories	period	commodities	harvested	production
unit 1	year 1	soybean	1111	1112
unit 1	year 1	maize	1121	1122
unit 1	year 2	soybean	1211	1212
unit 1	year 2	maize	1221	1222
unit 2	year 1	soybean	2111	2112
unit 2	year 1	maize	2121	2122
unit 2	year 2	soybean	2211	2212
unit 2	year 2	maize	2221	2222

makeSchema(list(
  header = list(row = 1),
  variables =
    list(territories =
           list(type = "id", col = 1),
         year =
           list(type = "id", col = 2),
         commodities =
           list(type = "id", col = 3),
         harvested =
           list(type = "measured", unit = "ha", factor = 1, col = 4),
         production =
           list(type = "measured", unit = "t", factor = 1, col = 5))))

In case identifying variables are factors with a limited number of levels, those levels may be used as names of other variables, where they would be next to each other and thus "wide". The other variables (both identifying and measured variables) would then be "nested" in those wide identifying variables. In those cases we have to record the row and the specific columns in which the identifying variables names are found, because those values will be used to reconstruct the variable as "long" variable. Beware that in those case you need to change the default header specification.

Table 2: The values variables are nested within the identifying variable `commodities`.
X1	X2	X3	X4	X5	X6
territories	period	soybean		maize
		harvested	production	harvested	production
unit 1	year 1	1111	1112	1121	1122
unit 1	year 2	1211	1212	1221	1222
unit 2	year 1	2111	2112	2112	2122
unit 2	year 2	2211	2212	2121	2222

makeSchema(list(
  header = list(row = c(1, 2)),
  variables =
    list(territories =
           list(type = "id", col = 1),
         year =
           list(type = "id", col = 2),
         commodities =
           list(type = "id", row = 1, col = c(3, 5)),
         harvested =
           list(type = "measured", unit = "ha", factor = 1,
                row = 2, col = c(3, 5)),
         production =
           list(type = "measured", unit = "t", factor = 1,
                row = 2, col = c(4, 6)))))

In case several variables are nested within other variables, we have to specify for all nested or nesting variables in which rows they sit.

Table 3: The identifying variable `commodities` is nested in the identifying variable `period`. The target variable is spread across those nested columns.
X1	X2	X3	X4	X5	X6	X7	X8	X9
territories	year 1				year 2
	soybean		maize		soybean		maize
	harvested	production	harvested	production	harvested	production	harvested	production
unit 1	1111	1112	1121	1122	1211	1212	1221	1222
unit 2	2111	2211	2121	2221	2112	2212	2122	2222

makeSchema(list(
  header = list(row = c(1:3)),
  variables =
    list(territories =
           list(type = "id", col = 1),
         year =
           list(type = "id", row = 1, col = c(2, 6)),
         commodities =
           list(type = "id", row = 2, col = c(2, 4, 6, 8)),
         harvested =
           list(type = "measured", unit = "ha", factor = 1,
                row = 3, col = c(2, 4, 6, 8)),
         production =
           list(type = "measured", unit = "t", factor = 1,
                row = 3, col = c(3, 5, 7, 9)))))

3.1.2 Several variables in one column

Sometimes it may be the case that several variables are stored in the same column, for example when a territorial unit is given together with the year of observation (e.g. unit1, year1). In those cases, the resulting variables in the output schema need to be specified so that they point to that same column but extract information via a regular expression (find more information here). For example, .+?(?=_) gives everything up until the first _ and (?<=\\_).* everything after the _.

Table 4: Several variables are stored in the same column.
X1	X2	X3	X4
territories	unit	harvested	production
unit 1	year 1_soybean	1111	1112
unit 1	year 1_maize	1121	1122
unit 1	year 2_soybean	1211	1212
unit 1	year 2_maize	1221	1222
unit 2	year 1_soybean	2111	2112
unit 2	year 1_maize	2121	2122
unit 2	year 2_soybean	2211	2212
unit 2	year 2_maize	2221	2222

makeSchema(list(
  header = list(row = 1),
  variables =
    list(territories =
           list(type = "id", col = 1),
         year =
           list(type = "id", split = ".+?(?=_)", col = 2),
         commodities =
           list(type = "id", split = "(?<=\\_).*", col = 2),
         harvested =
           list(type = "measured", unit = "ha", factor = 1, col = 3),
         production =
           list(type = "measured", unit = "t", factor = 1, col = 4))))

3.1.3 Implicit variables

When data are split up into clusters, the identifying variable is often recorded not in the table as an explicit variable, but is only provided in the file or spreadsheet name. In those cases, we have to register the variable nevertheless, and provide the value of that variable in value = of the identifying variable.

Table 5: Several variables are stored in the same column.
X1	X2	X3	X4
some metadata, which does not tell us that this is in fact unit 1
commodities	harvested	production	period
soybean	1111	1112	year 1
maize	1121	1122	year 1
soybean	1211	1212	year 2
maize	1221	1222	year 2

makeSchema(list(
  clusters =
    list(row = 2, col = 1, id = "territories"),
  header = list(row = 1, rel = TRUE),
  variables =
    list(territories =
           list(type = "id", value = "unit 1"),
         year =
           list(type = "id", col = 4, rel = TRUE),
         commodities =
           list(type = "id", col = 1, rel = TRUE),
         harvested =
           list(type = "measured", unit = "ha", factor = 1,
                col = 2, rel = TRUE),
         production =
           list(type = "measured", unit = "t", factor = 1,
                col = 3, rel = TRUE))))

3.1.4 Listed measured variables

Some tables contain a column where the names of measured variables (harvested and production) are treated as if they were the values of an identifying variable (dimension), while the values are presented in only one column (values) (Tab. 6). To end up with tidy data in those cases, we need to extract the values associated with the measured variables. Thus, we define the measured variables and specify the key = in which the variable names sit, and the value = the variable name has, to extract that variable.

Table 6: The variable names of the measured variable are treated as if they were the values of an identifying variable.
X1	X2	X3	X4	X5
territories	period	commodities	dimension	values
unit 1	year 1	soybean	harvested	1111
unit 1	year 1	maize	harvested	1121
unit 1	year 1	soybean	production	1112
unit 1	year 1	maize	production	1122
unit 1	year 2	soybean	harvested	1211
unit 1	year 2	maize	harvested	1221
unit 1	year 2	soybean	production	1212
unit 1	year 2	maize	production	1222
unit 2	year 1	soybean	harvested	2111
unit 2	year 1	maize	harvested	2121
unit 2	year 1	soybean	production	2112
unit 2	year 1	maize	production	2122
unit 2	year 2	soybean	harvested	2211
unit 2	year 2	maize	harvested	2221
unit 2	year 2	soybean	production	2212
unit 2	year 2	maize	production	2222

makeSchema(list(
  header = list(row = 1),
  variables =
    list(territories =
           list(type = "id", col = 1),
         year =
           list(type = "id", col = 2),
         commodities =
           list(type = "id", col = 3),
         harvested =
           list(type = "measured", unit = "ha", factor = 1,
                col = 5, key = "dimension", value = "harvested"),
         production =
           list(type = "measured", unit = "t", factor = 1,
                col = 5, key = "dimension", value = "production"))))

Moreover, (several) identifying variables may be wide additionally and we have to proceed as mentioned above, by providing the row and columns of the wide identifying variables.

Table 7: The identifying variable `commodities` is treated as if it were the observed variable.
X1	X2	X3	X4	X5
territories	period	dimension	soybean	maize
unit 1	year 1	harvested	1111	1121
unit 1	year 1	production	1112	1122
unit 1	year 2	harvested	1211	1221
unit 1	year 2	production	1212	1222
unit 2	year 1	harvested	2111	2121
unit 2	year 1	production	2112	2122
unit 2	year 2	harvested	2211	2221
unit 2	year 2	production	2212	2222

 makeSchema(list(
   header = list(row = 1),
   variables =
     list(territories =
            list(type = "id", col = 1),
          year =
            list(type = "id", row = NULL, col = 2),
          commodities =
            list(type = "id", row = 1, col = c(4, 5)),
          harvested =
            list(type = "measured", unit = "ha", factor = 1,
                 col = c(4, 5), key = "dimension", value = "harvested"),
          production =
            list(type = "measured", unit = "t", factor = 1,
                 col = c(4, 5), key = "dimension", value = "production"))))

3.2 Spreadsheet contains several clusters

Clusters are often of the same arrangement within one spreadsheet, they can be repeated along rows (horizontally) or along columns (vertically). A table should be treated like a cluster also when the spreadsheet contains not only the table, but perhaps also text that may be scattered across the document and that does not allow the table to start at the spreadsheet origin in the topmost left cell.

To reorganise the data into tidy form, each cluster is "cut out", rearranged individually and appended to the end of an output table by reorganise().

3.2.1 Horizontal clusters

In case horizontal clusters are sitting right next to each other in the same origin row (Tab. 8), it is sufficient to provide the topmost row and all leftmost columns at which a new cluster starts. In case there is some arbitrary horizontal space between clusters, also the width (of each cluster) needs to be provided.

Table 8: Horizontal clusters of the identifying variable `period`.
X1	X2	X3	X4	X5	X6	X7
period	commodities	harvested	production	commodities	harvested	production
	unit 1			unit 2
year 1	soybean	1111	1112	soybean	2111	2112
year 1	maize	1121	1122	maize	2121	2122
year 2	soybean	1211	1212	soybean	2211	2212
year 2	maize	1221	1222	maize	2221	2222

makeSchema(list(
  clusters =
    list(row = 2, col = c(2, 5), id = "territories"),
  header = list(row = 1),
  variables =
    list(territories =
           list(type = "id", row = 2, col = c(2, 5)),
         year =
           list(type = "id", col = 1),
         commodities =
           list(type = "id", col = c(2, 5)),
         harvested =
           list(type = "measured", col = c(3, 6), unit = "ha", factor = 1),
         production =
           list(type = "measured", col = c(4, 7), unit = "t", factor = 1))))

3.2.2 Vertical clusters

For vertically arranged clusters (Tab. 9), just like for the horizontal case, the respective rows, columns (and heights) need to be provided.

Table 9: Vertical clusters of the identifying variable `period`.
X1	X2	X3	X4	X5
territories	period	commodities	harvested	production
unit 1
	year 1	soybean	1111	1112
	year 1	maize	1121	1122
	year 2	soybean	1211	1212
	year 2	maize	1221	1222
unit 2
	year 1	soybean	2111	2112
	year 1	maize	2121	2122
	year 2	soybean	2211	2212
	year 2	maize	2221	2222

makeSchema(list(
  clusters =
    list(row = c(2, 7), col = 2, height = 4, id = "territories"),
  header = list(row = 1),
  variables =
    list(territories =
           list(type = "id", row = c(2, 9), col = 1),
         year =
           list(type = "id", col = 1, rel = TRUE),
         commodities =
           list(type = "id", col = 2, rel = TRUE),
         harvested =
           list(type = "measured", col = 3, unit = "ha", factor = 1, rel = TRUE),
         production =
           list(type = "measured", col = 4, unit = "t", factor = 1, rel = TRUE))))

3.2.3 Messy clusters

In case several clusters are neither aligned along a row nor a column, and are all of differing size, the respective information need to be provided at the same index of the respective property. For example, three clusters, where the first cluster starts at (1,1) and is 3 by 4 cells in size, where the second clusters starts at (5,2) and is 5 by 5 cells in size, and so on, needs to be specified as below.

list(clusters = 
       list(top = c(1, 5, 1), left = c(1, 2, 5), 
            width = c(3, 5, 2), height = c(4, 5, 3), id = ...),
     header = list(row = 1),
     variables =
       list(territories = 
              list(type = "id", col = 2, rel = TRUE),
            period = 
              list(type = "id", col = 1, rel = TRUE),
            ...))

Additionally, given that at least the tables within each cluster are all arranged in the same way, the contained variables can be specified so that their row and column indices are given relative to the cluster position (rel = TRUE). If also that is not the case, the row and column values for each cluster need to be provided for the respective variables in the same way as for cluster positions.

3.2.4 Clusters of measured variables

The previous two types of clusters are clusters of identifying variables, but it may also be the case that the measured variables are split up into distinct clusters. Here, we need to specify first of all in clusters "id = "measured" to indicate that the measured variable is the cluster ID. Next, we need to set up the measured variables so that they contain "key = "cluster" and in value the number of the cluster this variable can be found in. Moreover, we provide the column(s) and all rows that contain the values of each measured variable. This is in contrast to nested measured variables, where the rows of the variable names need to be provided.

Table 10: Vertical clusters of the measured variables.
X1	X2	X3	X4	X5
	territories	period	commodities	values
harvested
	unit 1	year 1	soybean	1111
	unit 1	year 1	maize	1121
	unit 1	year 2	soybean	1211
	unit 1	year 2	maize	1221
	unit 2	year 1	soybean	2111
	unit 2	year 1	maize	2121
	unit 2	year 2	soybean	2211
	unit 2	year 2	maize	2221
production
	unit 1	year 1	soybean	1112
	unit 1	year 1	maize	1122
	unit 1	year 2	soybean	1212
	unit 1	year 2	maize	1222
	unit 2	year 1	soybean	2112
	unit 2	year 1	maize	2122
	unit 2	year 2	soybean	2212
	unit 2	year 2	maize	2222

makeSchema(list(
  clusters =
    list(row = c(3, 12), col = 2, height = 8, id = "measured"),
  header = list(row = 1),
  variables =
    list(territories =
           list(type = "id", col = 2),
         year =
           list(type = "id", col = 3),
         commodities =
           list(type = "id", col = 4),
         harvested =
           list(type = "measured", unit = "ha", factor = 1, col = 5,
                key = "cluster", value = 1),
         production =
           list(type = "measured", unit = "t", factor = 1, col = 5,
                key = "cluster", value = 2))))

3.2.5 Variables that are distinct from a cluster

When not all identifying variables can be provided relative to the cluster origin, for example because they are missing for some clusters, it makes more sense to define such a variable as a distinct variable. This is done by providing row and col as absolute values and setting dist = TRUE.

Table 11: Several clusters where one variable is not available for each cluster, but distinct of them.
X1	X2	X3	X4	X5	X6	X7
commodities	harvested	production
unit 1
soybean	1111	1112	year 1
maize	1121	1122	year 1
soybean	1211	1212	year 2
maize	1221	1222	year 2

commodities	harvested	production	commodities	harvested	production
unit 2			unit 3
soybean	2111	2112	soybean	3111	3112	year 1
maize	2121	2122	maize	3121	3122	year 1
soybean	2211	2212	soybean	3211	3212	year 2
maize	2221	2222	maize	3221	3222	year 2

makeSchema(list(
  clusters =
    list(row = c(1, 8, 8), col = c(1, 1, 4), width = 3, height = 6,
         id = "territories"),
  meta = list(del = NULL, dec = NULL, na = NULL, types = NULL),
  header = list(row = 1, rel = TRUE),
  variables =
    list(territories =
           list(type = "id", row = 1, col = 1, rel = TRUE),
         year =
           list(type = "id", row = c(3:6), col = 4, dist = TRUE),
         commodities =
           list(type = "id", col = 1, rel = TRUE),
         harvested =
           list(type = "measured", unit = "ha", factor = 1,
                col = 2, rel = TRUE),
         production =
           list(type = "measured", unit = "t", factor = 1,
                col = 3, rel = TRUE))))

References

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23. https://doi.org/10.18637/jss.v059.i10.

tabshiftr - Reshape disorganised messy data