The validate package is intended to make checking your data easy, maintainable, and reproducible. The package allows you to
There are a few terms related to the infrastructure offered by validate,
validator
an object representing a set of rules your data must satisfyindicator
an object representing a set of numerical quality indicatorsconfrontation
an object representing the results of confronting data with rules or quality indicators.There is also a single activity, namely
confront
: evaluate the validation rules or quality indicator in the context of one or more data sets.This vignette demonstrates how to apply validation rules. A second vignette introduces quality indicator rules.
Here’s an example demonstrating the typical workflow. We’ll use the built-in women
data set (average heights and weights for American women aged 30-39).
data(women)
summary(women)
## height weight
## Min. :58.0 Min. :115.0
## 1st Qu.:61.5 1st Qu.:124.5
## Median :65.0 Median :135.0
## Mean :65.0 Mean :136.7
## 3rd Qu.:68.5 3rd Qu.:148.0
## Max. :72.0 Max. :164.0
Validating data is all about checking whether a data set meets presumptions or expectations you have about it, and the validate package makes it easy for you to define those expectations. Let’s do a quick check on variables in the women
data set.
library(validate)
cf <- check_that(women, height > 0, weight > 0, height/weight > 0.5)
summary(cf)
## name items passes fails nNA error warning expression
## 1 V1 15 15 0 0 FALSE FALSE height > 0
## 2 V2 15 15 0 0 FALSE FALSE weight > 0
## 3 V3 15 2 13 0 FALSE FALSE height/weight > 0.5
check_that
returns an object containing all sorts of information on the validation results. The easiest way to check the results is with summary
, which returns a data.frame
with the following basic information:
NA
If you’re a fan of the pipe-operator provided by the magrittr, the above statement can also be performed as follows.
women %>% check_that(height > 0, weight > 0, height/weight > 0.5) %>% summary()
The same information can be summarized graphically.
barplot(cf,main="Checks on the women data set")
For some checks it is convenient to compare the data under scrutiny with other data artifacts. Two common examples include:
For this, we can use the ref
option in confront. Here is how to compare columns from two data frames row-by-row. The user has to make sure that the rows of the data set under scrutiny (women
) matches row-wise with the reference data set (women1
).
women1 <- women
rules <- validator(height == women_reference$height)
cf <- confront(women, rules, ref = list(women_reference = women1))
summary(cf)
## name items passes fails nNA error warning
## 1 V1 15 15 0 0 FALSE FALSE
## expression
## 1 height == women_reference[, "height"]
Here is how to make a code list available.
rules <- validator( fruit %in% codelist )
fruits <- c("apple", "banana", "orange")
dat <- data.frame(fruit = c("apple","broccoli","orange","banana"))
cf <- confront(dat, rules, ref = list(codelist = fruits))
summary(cf)
## name items passes fails nNA error warning expression
## 1 V1 4 3 1 0 FALSE FALSE fruit %vin% codelist
Validator objects are used to store, investigate and manipulate rule sets.
v <- validator(height > 0, weight > 0, height/weight > 0)
v
## Object of class 'validator' with 3 elements:
## V1: height > 0
## V2: weight > 0
## V3: height/weight > 0
The validator object has stored the rule and assigned names to them for future reference. To check this, we confront the data set with the validation rules we’ve just defined:
cf <- confront(women,v)
cf
## Object of class 'validation'
## Call:
## confront(dat = women, x = v)
##
## Confrontations: 3
## With fails : 0
## Warnings : 0
## Errors : 0
The object cf
contains the result of checking the data in women
against the expectations in v
. The fact that there are no warnings or errors means that indeed each rule could be evaluated successfully (an error would occur for example, if we’d misspell height). Now let’s take a look at the actual results.
summary(cf)
## name items passes fails nNA error warning expression
## 1 V1 15 15 0 0 FALSE FALSE height > 0
## 2 V2 15 15 0 0 FALSE FALSE weight > 0
## 3 V3 15 15 0 0 FALSE FALSE height/weight > 0
Now, suppose that we expect that the BMI (weight divided by height squared) of each item to be below 23. We need to express the weight in kg and the height in meters, so the equation for BMI becomes \[ BMI = \frac{weight\times0.45359}{(height\times0.0254)^2} \] Moreover, assume that we suspect that the average BMI is between 22 and 22.5. Let’s create another validator object that first computes the BMI and next tests whether the BMI values conform to our suspicion.
v <- validator(
BMI := (weight*0.45359)/(height*0.0254)^2
, height > 0
, weight > 0
, BMI < 23
, mean(BMI) > 22 & mean(BMI) < 22.5
)
v
## Object of class 'validator' with 5 elements:
## V1: `:=`(BMI, (weight * 0.45359)/(height * 0.0254)^2)
## V2: height > 0
## V3: weight > 0
## V4: BMI < 23
## V5: mean(BMI) > 22 & mean(BMI) < 22.5
Checking is easy as before:
cf <- confront(women,v)
summary(cf)
## name items passes fails nNA error warning
## 1 V2 15 15 0 0 FALSE FALSE
## 2 V3 15 15 0 0 FALSE FALSE
## 3 V4 15 10 5 0 FALSE FALSE
## 4 V5 1 0 1 0 FALSE FALSE
## expression
## 1 height > 0
## 2 weight > 0
## 3 (weight * 0.45359)/(height * 0.0254)^2 < 23
## 4 mean((weight * 0.45359)/(height * 0.0254)^2) > 22 & mean((weight * 0.45359)/(height * 0.0254)^2) < 22.5
Observe that the expressions for validation are now manipulated: everywhere where BMI
was used, it was replaced with the computation defined before.
data.frames
Validator objects can be read from and converted to data.frames
. To create a validator object, at least a character column named rule
is necessary.
df <- data.frame(
rule = c("height>0","weight>0","height/weight>0.5")
, label = c("height positive","weight positive","ratio limit")
)
v <- validator(.data=df)
v
## Object of class 'validator' with 3 elements:
## V1 [height positive]: height > 0
## V2 [weight positive]: weight > 0
## V3 [ratio limit] : height/weight > 0.5
Now confront with the data and merge back with rule metadata.
cf <- confront(women, v)
quality <- as.data.frame(cf)
measure <- as.data.frame(v)
head( merge(quality, measure) )
## name value expression label description origin created
## 1 V1 TRUE height > 0 height positive 2019-12-16 15:51:22
## 2 V1 TRUE height > 0 height positive 2019-12-16 15:51:22
## 3 V1 TRUE height > 0 height positive 2019-12-16 15:51:22
## 4 V1 TRUE height > 0 height positive 2019-12-16 15:51:22
## 5 V1 TRUE height > 0 height positive 2019-12-16 15:51:22
## 6 V1 TRUE height > 0 height positive 2019-12-16 15:51:22
## language severity rule
## 1 validate 0.9.3 error height > 0
## 2 validate 0.9.3 error height > 0
## 3 validate 0.9.3 error height > 0
## 4 validate 0.9.3 error height > 0
## 5 validate 0.9.3 error height > 0
## 6 validate 0.9.3 error height > 0
Or, merge with the summarized results. The result of summary
is just a data.frame
(tidy, isn’t it?).
merge(summary(cf),measure)
## name items passes fails nNA error warning expression label
## 1 V1 15 15 0 0 FALSE FALSE height > 0 height positive
## 2 V2 15 15 0 0 FALSE FALSE weight > 0 weight positive
## 3 V3 15 2 13 0 FALSE FALSE height/weight > 0.5 ratio limit
## description origin created language severity
## 1 2019-12-16 15:51:22 validate 0.9.3 error
## 2 2019-12-16 15:51:22 validate 0.9.3 error
## 3 2019-12-16 15:51:22 validate 0.9.3 error
## rule
## 1 height > 0
## 2 weight > 0
## 3 height/weight > 0.5
Conceptually, any R statement that will evaluate to a logical
is considered a validating statement. The validate package checks this when the user defines a rule set, so for example calling validator( mean(height) )
will result in a warning since just computing mean(x)
does not validate anything.
You will find a concise description of the syntax in the syntax
help file.
?syntax
Examples of various types of rules can also be found here.
In short, you can use
is.
.<, <=, ==, !=, >=, >
and %in%
.!, all(), any()
.&, &&, |, ||
and logical implication, e.g. if (staff > 0) staff.costs > 0
.grepl
X ~ Y + Z
.There are some convenience functions.
.
, e.g. validator( nrow(.) > 10)
.:=
, e.g. validator(m := mean(x), x < 2*mean(x) )
.var_group
, e.g. validator(G:=var_group(x,y), G > 0)
.The outcome of confronting a validator object with a data set is an object of class confrontation
. There are several ways to extract information from a confrontation
object.
summary
: summarize output; returns a data.frame
aggregate
: aggregate validation in several wayssort
: aggregate and sort in several waysvalues
: Get the values in an array, or a list of arrays if rules have different output dimension structureerrors
: Retrieve error messages caught during the confrontationwarnings
: Retrieve warning messages caught during the confrontation.By default aggregates are produced by rule.
cf <- check_that(women, height>0, weight>0,height/weight < 0.5)
aggregate(cf)
## npass nfail nNA rel.pass rel.fail rel.NA
## V1 15 0 0 1.0 0.0 0
## V2 15 0 0 1.0 0.0 0
## V3 12 3 0 0.8 0.2 0
To aggregate by record, use by='record'
head(aggregate(cf,by='record'))
## npass nfail nNA rel.pass rel.fail rel.NA
## 1 2 1 0 0.6666667 0.3333333 0
## 2 2 1 0 0.6666667 0.3333333 0
## 3 2 1 0 0.6666667 0.3333333 0
## 4 3 0 0 1.0000000 0.0000000 0
## 5 3 0 0 1.0000000 0.0000000 0
## 6 3 0 0 1.0000000 0.0000000 0
Aggregated results can be automatically sorted, so records with the most violations or rules that are violated most sort higher.
# rules with most violations sorting first:
sort(cf)
## npass nfail nNA rel.pass rel.fail rel.NA
## V3 12 3 0 0.8 0.2 0
## V1 15 0 0 1.0 0.0 0
## V2 15 0 0 1.0 0.0 0
Confrontation objects can be subsetted with single bracket operators (like vectors), to obtain a sub-object pertaining only to the selected rules.
summary(cf[c(1,3)])
By default, all errors and warnings are caught when validation rules are confronted with data. This can be switched off by setting the raise
option to "errors"
or "all"
. The following example contains a specification error: hite
should be height
and therefore the rule errors on the women
data.frame because it does not contain a column hite
. The error is caught (not resulting in a R error) and shown in the summary,
v <- validator(hite > 0, weight>0)
summary(confront(women, v))
## name items passes fails nNA error warning expression
## 1 V1 0 0 0 0 TRUE FALSE hite > 0
## 2 V2 15 15 0 0 FALSE FALSE weight > 0
Setting raise
to all
results in a R error:
# this gives an error
confront(women, v, raise='all')
## Error in fun(...): object 'hite' not found
Linear equalities form an important class of validation rules. To prevent equalities to be strictly tested, there is an option called len.eq.eps
(with default value \(10^{-8}\)) that allows one to add some slack to these tests. The amount of slack is intended to prevent false negatives (unneccesary failures) caused by machine rounding. If you want to check whether a sum-rule is satisfied to within one or two units of measurement, it is cleaner to define two inequalities for that.
Validator objects store a set of rules, optionally with some metadata per rule. Currently, the following functions can be used to get or set metadata:
origin
: Where was a rule defined?names
: The name per rulecreated
: when were the rules created?label
: Short description of the ruledescription
: Long description of the rulemeta
: Set or get generic metadataFor example, names can be set from the command line when defining a validator object.
v <- validator(rat = height/weight > 0.5, htest=height>0, wtest=weight > 0)
names(v)
## [1] "rat" "htest" "wtest"
Also try
names(v)[1] <- "ratio"
v
## Object of class 'validator' with 3 elements:
## ratio: height/weight > 0.5
## htest: height > 0
## wtest: weight > 0
It is also possible to add generic key-value pairs as metadata. Getting and setting follows the usual recycling rules of R.
# add 'foo' to the first rule:
meta(v[1],"foo") <- 1
# Add 'bar' to all rules
meta(v,"bar") <- "baz"
Metadata can be made visible by selecting a single rule:
v[[1]]
##
## Object of class rule.
## expr : height/weight > 0.5
## name : ratio
## label :
## description:
## origin : command-line
## created : 2019-12-16 15:51:22
## meta : language<chr>, severity<chr>, foo<nmr>, bar<chr>
Or by extracting it to a data.frame
meta(v)
## name label description origin created language
## 1 ratio command-line 2019-12-16 15:51:22 validate 0.9.3
## 2 htest command-line 2019-12-16 15:51:22 validate 0.9.3
## 3 wtest command-line 2019-12-16 15:51:22 validate 0.9.3
## severity foo bar
## 1 error 1 baz
## 2 error NA baz
## 3 error NA baz
Some general information is obtained with summary
,
summary(v)
## block nvar rules linear
## 1 1 2 3 2
Here, some properties per block of rules is given. Two rules occur in the same block if when they share a variable. In this case, all rules occur in the same block.
and the number of rules can be requested with length
length(v)
With variables
, the variables occurring per rule, or over all the rules can be requested.
variables(v)
## [1] "height" "weight"
variables(v,as="matrix")
## variable
## rule height weight
## ratio TRUE TRUE
## htest TRUE FALSE
## wtest FALSE TRUE
Validator objects can be subsetted as if they were lists using the single and double bracket operators.
v[c(1,3)]
## Object of class 'validator' with 2 elements:
## ratio: height/weight > 0.5
## wtest: weight > 0
## Rules are evaluated using locally defined options
v[c('ratio','wtest')]
## Object of class 'validator' with 2 elements:
## ratio: height/weight > 0.5
## wtest: weight > 0
## Rules are evaluated using locally defined options
The double bracket can be used to inspect a single rule
v[[1]]
##
## Object of class rule.
## expr : height/weight > 0.5
## name : ratio
## label :
## description:
## origin : command-line
## created : 2019-12-16 15:51:22
## meta : language<chr>, severity<chr>, foo<nmr>, bar<chr>
As simple as that. If you do
w <- v
for a validator object v
, then w
just points to the same physical object as v
. To make an actual copy, you can select everything.
w <- v[]