Data quality indicators with the validate package

Mark van der Loo and Edwin de Jonge

2019-12-16

We assume that the reader went through the first couple of sections of the introductory vignette.

In the validate package, an ‘indicator’ is a rule or function that takes as input a data set and outputs a number. Indicators are usually designed to be easily interpretable by domain experts and therefore depend strongly on the application. In ‘validate’ users are free to specify indicator. By specifing them separate from the programming workflow, they can be treated as first-class objects: indicator specs can be maintained, version-controlled, and documented in separate files (just like validation rules.)

Workflow

Here is a simple example of the workflow.

i <- indicator(
    mh  = mean(height)
  , mw  = mean(weight)
  , BMI = (weight/2.2046)/(height*0.0254)^2 )
ind <- confront(women, i)

In the first statement we define an indicator object storing indicator expressions. Next, we confront a dataset with these indicators. The result is an object of class indication. It prints as follows.

ind
## Object of class 'indication'
## Call:
##     confront(dat = women, x = i)
## 
## Confrontations: 3
## Warnings      : 0
## Errors        : 0

To study the results, the object can be summarized.

summary(ind)
##   name items      min      mean       max nNA error warning
## 1   mh     1  65.0000  65.00000  65.00000   0 FALSE   FALSE
## 2   mw     1 136.7333 136.73333 136.73333   0 FALSE   FALSE
## 3  BMI    15  22.0967  22.72691  24.03503   0 FALSE   FALSE
##                            expression
## 1                        mean(height)
## 2                        mean(weight)
## 3 (weight/2.2046)/(height * 0.0254)^2

Observe that the first two indicators result in a single value (mh, mw) and the third one results in 15 values (BMI). The columns error and warning indicate wether calculation of the indicators was problematic.

A specific problem that may occur is when the result of an indicator is non-numeric.

jj <- indicator(mh = mean(height), a = {"A"})

here, the second ‘indicator’ is an expression that always yields a constant (the character string "A").

cf <- confront(women, jj)
cf
## Object of class 'indication'
## Call:
##     confront(dat = women, x = jj)
## 
## Confrontations: 2
## Warnings      : 1
## Errors        : 0
warnings(cf)
## $a
## [1] "Expression did not evaluate to numeric or logical, returning NULL"

Getting the values

Values can be obtained with the values function, or by converting to a data.frame.

We add a unique identifier (this is optional) to make it easier to connect results with the data.

women$id <- letters[1:15]

Compute indicators and convert to data.frame.

ind <- confront(women, i,key="id")
(out <- as.data.frame(ind))
##      id name     value                          expression
## 1  <NA>   mh  65.00000                        mean(height)
## 2  <NA>   mw 136.73333                        mean(weight)
## 3     a  BMI  24.03503 (weight/2.2046)/(height * 0.0254)^2
## 4     b  BMI  23.63114 (weight/2.2046)/(height * 0.0254)^2
## 5     c  BMI  23.43589 (weight/2.2046)/(height * 0.0254)^2
## 6     d  BMI  23.24065 (weight/2.2046)/(height * 0.0254)^2
## 7     e  BMI  23.04570 (weight/2.2046)/(height * 0.0254)^2
## 8     f  BMI  22.85132 (weight/2.2046)/(height * 0.0254)^2
## 9     g  BMI  22.65775 (weight/2.2046)/(height * 0.0254)^2
## 10    h  BMI  22.46518 (weight/2.2046)/(height * 0.0254)^2
## 11    i  BMI  22.43519 (weight/2.2046)/(height * 0.0254)^2
## 12    j  BMI  22.24034 (weight/2.2046)/(height * 0.0254)^2
## 13    k  BMI  22.19922 (weight/2.2046)/(height * 0.0254)^2
## 14    l  BMI  22.15113 (weight/2.2046)/(height * 0.0254)^2
## 15    m  BMI  22.09670 (weight/2.2046)/(height * 0.0254)^2
## 16    n  BMI  22.17600 (weight/2.2046)/(height * 0.0254)^2
## 17    o  BMI  22.24240 (weight/2.2046)/(height * 0.0254)^2

Observe that there is no key for indicators mh and mw since these are constructed from multiple records.

Indicators and data.frames

Indicators can be constructed from and coerced to data.frames. To define an indicator you need to create a data.frame that at least has a character column called rule. All other columns are optional.

idf <- data.frame(
  rule = c("mean(height)","sd(height)")
  , label = c("average height", "std.dev height")
  , description = c("basic statistic","fancy statistic")
)
i <- indicator(.data=idf)
i
## Object of class 'indicator' with 2 elements:
##  I1 [average height]: mean(height)
##  I2 [std.dev height]: sd(height)

Now, confront with data and merge the results back with rule metadata.

quality <- as.data.frame(confront(women, i))
measures <- as.data.frame(i)
merge(quality, measures)
##   name     value   expression          label     description origin
## 1   I1 65.000000 mean(height) average height basic statistic       
## 2   I2  4.472136   sd(height) std.dev height fancy statistic       
##               created         rule
## 1 2019-12-16 15:51:22 mean(height)
## 2 2019-12-16 15:51:22   sd(height)