Introduction to Validate

A quick example

Here’s an example demonstrating the typical workflow. We’ll use the built-in women data set (average heights and weights for American women aged 30-39).

data(women)
summary(women)

##      height         weight     
##  Min.   :58.0   Min.   :115.0  
##  1st Qu.:61.5   1st Qu.:124.5  
##  Median :65.0   Median :135.0  
##  Mean   :65.0   Mean   :136.7  
##  3rd Qu.:68.5   3rd Qu.:148.0  
##  Max.   :72.0   Max.   :164.0

Validating data is all about checking whether a data set meets presumptions or expectations you have about it, and the validate package makes it easy for you to define those expectations. Let’s do a quick check on variables in the women data set.

library(validate)
cf <- check_that(women, height > 0, weight > 0, height/weight > 0.5)
summary(cf)

##   name items passes fails nNA error warning          expression
## 1   V1    15     15     0   0 FALSE   FALSE          height > 0
## 2   V2    15     15     0   0 FALSE   FALSE          weight > 0
## 3   V3    15      2    13   0 FALSE   FALSE height/weight > 0.5

check_that returns an object containing all sorts of information on the validation results. The easiest way to check the results is with summary, which returns a data.frame with the following basic information:

How many data items were checked against each rule
How many items passed, failed or resulted in NA
Whether the check resulted in an error (could not be performed) or gave an error
The expression that was actually evaluated to perform the check.

If you’re a fan of the pipe-operator provided by the magrittr, the above statement can also be performed as follows.

women %>% check_that(height > 0, weight > 0, height/weight > 0.5) %>% summary()

The same information can be summarized graphically.

barplot(cf,main="Checks on the women data set")

Using reference data

For some checks it is convenient to compare the data under scrutiny with other data artifacts. Two common examples include:

Data is checked against an earlier version of the same dataset.
We wish to check the contents of a column against a code list, and we do not want to put the code list hard-coded into the rule set.

For this, we can use the ref option in confront. Here is how to compare columns from two data frames row-by-row. The user has to make sure that the rows of the data set under scrutiny (women) matches row-wise with the reference data set (women1).

women1 <- women
rules <- validator(height == women_reference$height)
cf <- confront(women, rules, ref = list(women_reference = women1))
summary(cf)

##   name items passes fails nNA error warning
## 1   V1    15     15     0   0 FALSE   FALSE
##                              expression
## 1 height == women_reference[, "height"]

Here is how to make a code list available.

rules <- validator( fruit %in% codelist )
fruits <-  c("apple", "banana", "orange")
dat <- data.frame(fruit = c("apple","broccoli","orange","banana"))
cf <- confront(dat, rules, ref = list(codelist = fruits))
summary(cf)

##   name items passes fails nNA error warning           expression
## 1   V1     4      3     1   0 FALSE   FALSE fruit %vin% codelist

Validator objects

Validator objects are used to store, investigate and manipulate rule sets.

v <- validator(height > 0, weight > 0, height/weight > 0)
v

## Object of class 'validator' with 3 elements:
##  V1: height > 0
##  V2: weight > 0
##  V3: height/weight > 0

The validator object has stored the rule and assigned names to them for future reference. To check this, we confront the data set with the validation rules we’ve just defined:

cf <- confront(women,v)
cf

## Object of class 'validation'
## Call:
##     confront(dat = women, x = v)
## 
## Confrontations: 3
## With fails    : 0
## Warnings      : 0
## Errors        : 0

The object cf contains the result of checking the data in women against the expectations in v. The fact that there are no warnings or errors means that indeed each rule could be evaluated successfully (an error would occur for example, if we’d misspell height). Now let’s take a look at the actual results.

summary(cf)

##   name items passes fails nNA error warning        expression
## 1   V1    15     15     0   0 FALSE   FALSE        height > 0
## 2   V2    15     15     0   0 FALSE   FALSE        weight > 0
## 3   V3    15     15     0   0 FALSE   FALSE height/weight > 0

Now, suppose that we expect that the BMI (weight divided by height squared) of each item to be below 23. We need to express the weight in kg and the height in meters, so the equation for BMI becomes \[ BMI = \frac{weight\times0.45359}{(height\times0.0254)^2} \] Moreover, assume that we suspect that the average BMI is between 22 and 22.5. Let’s create another validator object that first computes the BMI and next tests whether the BMI values conform to our suspicion.

v <- validator(
  BMI := (weight*0.45359)/(height*0.0254)^2
  , height > 0
  , weight > 0
  , BMI < 23
  , mean(BMI) > 22 & mean(BMI) < 22.5
)
v

## Object of class 'validator' with 5 elements:
##  V1: `:=`(BMI, (weight * 0.45359)/(height * 0.0254)^2)
##  V2: height > 0
##  V3: weight > 0
##  V4: BMI < 23
##  V5: mean(BMI) > 22 & mean(BMI) < 22.5

Checking is easy as before:

cf <- confront(women,v)
summary(cf)

##   name items passes fails nNA error warning
## 1   V2    15     15     0   0 FALSE   FALSE
## 2   V3    15     15     0   0 FALSE   FALSE
## 3   V4    15     10     5   0 FALSE   FALSE
## 4   V5     1      0     1   0 FALSE   FALSE
##                                                                                                expression
## 1                                                                                              height > 0
## 2                                                                                              weight > 0
## 3                                                             (weight * 0.45359)/(height * 0.0254)^2 < 23
## 4 mean((weight * 0.45359)/(height * 0.0254)^2) > 22 & mean((weight * 0.45359)/(height * 0.0254)^2) < 22.5

Observe that the expressions for validation are now manipulated: everywhere where BMI was used, it was replaced with the computation defined before.

Conversion from and to `data.frames`

Validator objects can be read from and converted to data.frames. To create a validator object, at least a character column named rule is necessary.

df <- data.frame(
  rule = c("height>0","weight>0","height/weight>0.5")
  , label = c("height positive","weight positive","ratio limit")
)
v <- validator(.data=df)
v

## Object of class 'validator' with 3 elements:
##  V1 [height positive]: height > 0
##  V2 [weight positive]: weight > 0
##  V3 [ratio limit]    : height/weight > 0.5

Now confront with the data and merge back with rule metadata.

cf <- confront(women, v)
quality <- as.data.frame(cf)
measure <- as.data.frame(v)
head( merge(quality, measure) )

##   name value expression           label description origin             created
## 1   V1  TRUE height > 0 height positive                    2019-12-16 15:51:22
## 2   V1  TRUE height > 0 height positive                    2019-12-16 15:51:22
## 3   V1  TRUE height > 0 height positive                    2019-12-16 15:51:22
## 4   V1  TRUE height > 0 height positive                    2019-12-16 15:51:22
## 5   V1  TRUE height > 0 height positive                    2019-12-16 15:51:22
## 6   V1  TRUE height > 0 height positive                    2019-12-16 15:51:22
##         language severity       rule
## 1 validate 0.9.3    error height > 0
## 2 validate 0.9.3    error height > 0
## 3 validate 0.9.3    error height > 0
## 4 validate 0.9.3    error height > 0
## 5 validate 0.9.3    error height > 0
## 6 validate 0.9.3    error height > 0

Or, merge with the summarized results. The result of summary is just a data.frame (tidy, isn’t it?).

merge(summary(cf),measure)

##   name items passes fails nNA error warning          expression           label
## 1   V1    15     15     0   0 FALSE   FALSE          height > 0 height positive
## 2   V2    15     15     0   0 FALSE   FALSE          weight > 0 weight positive
## 3   V3    15      2    13   0 FALSE   FALSE height/weight > 0.5     ratio limit
##   description origin             created       language severity
## 1                    2019-12-16 15:51:22 validate 0.9.3    error
## 2                    2019-12-16 15:51:22 validate 0.9.3    error
## 3                    2019-12-16 15:51:22 validate 0.9.3    error
##                  rule
## 1          height > 0
## 2          weight > 0
## 3 height/weight > 0.5

Validation rule syntax

Conceptually, any R statement that will evaluate to a logical is considered a validating statement. The validate package checks this when the user defines a rule set, so for example calling validator( mean(height) ) will result in a warning since just computing mean(x) does not validate anything.

You will find a concise description of the syntax in the syntax help file.

?syntax

Examples of various types of rules can also be found here.

In short, you can use

Type checks: any function starting with is..
Binary comparisons: <, <=, ==, !=, >=, > and %in%.
Unary logical operators: !, all(), any().
Binary logical operators: &, &&, |, || and logical implication, e.g. if (staff > 0) staff.costs > 0.
Text search: grepl
Functional dependency: \(X\to Y + Z\) is represented by X ~ Y + Z.

There are some convenience functions.

Inspect the whole data set using ., e.g. validator( nrow(.) > 10).
Reuse a variable using :=, e.g. validator(m := mean(x), x < 2*mean(x) ).
Apply the same rule to multiple groups with var_group, e.g. validator(G:=var_group(x,y), G > 0).

Confrontation objects

The outcome of confronting a validator object with a data set is an object of class confrontation. There are several ways to extract information from a confrontation object.

summary: summarize output; returns a data.frame
aggregate: aggregate validation in several ways
sort : aggregate and sort in several ways
values: Get the values in an array, or a list of arrays if rules have different output dimension structure
errors: Retrieve error messages caught during the confrontation
warnings: Retrieve warning messages caught during the confrontation.

By default aggregates are produced by rule.

cf <- check_that(women, height>0, weight>0,height/weight < 0.5)
aggregate(cf)

##    npass nfail nNA rel.pass rel.fail rel.NA
## V1    15     0   0      1.0      0.0      0
## V2    15     0   0      1.0      0.0      0
## V3    12     3   0      0.8      0.2      0

To aggregate by record, use by='record'

head(aggregate(cf,by='record'))

##   npass nfail nNA  rel.pass  rel.fail rel.NA
## 1     2     1   0 0.6666667 0.3333333      0
## 2     2     1   0 0.6666667 0.3333333      0
## 3     2     1   0 0.6666667 0.3333333      0
## 4     3     0   0 1.0000000 0.0000000      0
## 5     3     0   0 1.0000000 0.0000000      0
## 6     3     0   0 1.0000000 0.0000000      0

Aggregated results can be automatically sorted, so records with the most violations or rules that are violated most sort higher.

# rules with most violations sorting first:
sort(cf)

##    npass nfail nNA rel.pass rel.fail rel.NA
## V3    12     3   0      0.8      0.2      0
## V1    15     0   0      1.0      0.0      0
## V2    15     0   0      1.0      0.0      0

Confrontation objects can be subsetted with single bracket operators (like vectors), to obtain a sub-object pertaining only to the selected rules.

summary(cf[c(1,3)])

Confrontation options

By default, all errors and warnings are caught when validation rules are confronted with data. This can be switched off by setting the raise option to "errors" or "all". The following example contains a specification error: hite should be height and therefore the rule errors on the women data.frame because it does not contain a column hite. The error is caught (not resulting in a R error) and shown in the summary,

v <- validator(hite > 0, weight>0)
summary(confront(women, v))

##   name items passes fails nNA error warning expression
## 1   V1     0      0     0   0  TRUE   FALSE   hite > 0
## 2   V2    15     15     0   0 FALSE   FALSE weight > 0

Setting raise to all results in a R error:

# this gives an error
confront(women, v, raise='all')

## Error in fun(...): object 'hite' not found

Linear equalities form an important class of validation rules. To prevent equalities to be strictly tested, there is an option called len.eq.eps (with default value \(10^{-8}\)) that allows one to add some slack to these tests. The amount of slack is intended to prevent false negatives (unneccesary failures) caused by machine rounding. If you want to check whether a sum-rule is satisfied to within one or two units of measurement, it is cleaner to define two inequalities for that.

Metadata and investigating validator objects

Validator objects store a set of rules, optionally with some metadata per rule. Currently, the following functions can be used to get or set metadata:

origin : Where was a rule defined?
names : The name per rule
created : when were the rules created?
label : Short description of the rule
description: Long description of the rule
meta: Set or get generic metadata

For example, names can be set from the command line when defining a validator object.

v <- validator(rat = height/weight > 0.5, htest=height>0, wtest=weight > 0)
names(v)

## [1] "rat"   "htest" "wtest"

Also try

names(v)[1] <- "ratio"
v

## Object of class 'validator' with 3 elements:
##  ratio: height/weight > 0.5
##  htest: height > 0
##  wtest: weight > 0

It is also possible to add generic key-value pairs as metadata. Getting and setting follows the usual recycling rules of R.

# add 'foo' to the first rule:
meta(v[1],"foo") <- 1
# Add 'bar' to all rules
meta(v,"bar") <- "baz"

Metadata can be made visible by selecting a single rule:

v[[1]]

## 
## Object of class rule.
##  expr       : height/weight > 0.5 
##  name       : ratio 
##  label      :  
##  description:  
##  origin     : command-line 
##  created    : 2019-12-16 15:51:22
##  meta       : language<chr>, severity<chr>, foo<nmr>, bar<chr>

Or by extracting it to a data.frame

meta(v)

##    name label description       origin             created       language
## 1 ratio                   command-line 2019-12-16 15:51:22 validate 0.9.3
## 2 htest                   command-line 2019-12-16 15:51:22 validate 0.9.3
## 3 wtest                   command-line 2019-12-16 15:51:22 validate 0.9.3
##   severity foo bar
## 1    error   1 baz
## 2    error  NA baz
## 3    error  NA baz

Some general information is obtained with summary,

summary(v)

##   block nvar rules linear
## 1     1    2     3      2

Here, some properties per block of rules is given. Two rules occur in the same block if when they share a variable. In this case, all rules occur in the same block.

and the number of rules can be requested with length

length(v)

With variables, the variables occurring per rule, or over all the rules can be requested.

variables(v)

## [1] "height" "weight"

variables(v,as="matrix")

##        variable
## rule    height weight
##   ratio   TRUE   TRUE
##   htest   TRUE  FALSE
##   wtest  FALSE   TRUE

Validator objects can be subsetted as if they were lists using the single and double bracket operators.

v[c(1,3)]

## Object of class 'validator' with 2 elements:
##  ratio: height/weight > 0.5
##  wtest: weight > 0
## Rules are evaluated using locally defined options

v[c('ratio','wtest')]

## Object of class 'validator' with 2 elements:
##  ratio: height/weight > 0.5
##  wtest: weight > 0
## Rules are evaluated using locally defined options

The double bracket can be used to inspect a single rule

v[[1]]

## 
## Object of class rule.
##  expr       : height/weight > 0.5 
##  name       : ratio 
##  label      :  
##  description:  
##  origin     : command-line 
##  created    : 2019-12-16 15:51:22
##  meta       : language<chr>, severity<chr>, foo<nmr>, bar<chr>

Introduction to Validate

Mark van der Loo and Edwin de Jonge

2019-12-16

Introduction

A quick example

Using reference data

Validator objects

Conversion from and to `data.frames`

Validation rule syntax

Confrontation objects

Confrontation options

Metadata and investigating validator objects

Validator objects and confrontation objects are reference objects

Introduction to Validate

Mark van der Loo and Edwin de Jonge

2019-12-16

Introduction

A quick example

Using reference data

Validator objects

Conversion from and to data.frames

Validation rule syntax

Confrontation objects

Confrontation options

Metadata and investigating validator objects

Validator objects and confrontation objects are reference objects

Conversion from and to `data.frames`