Labelled Data and the sjlabelled-Package

Daniel Lüdecke

2020-06-25

This package provides functions to read and write data between R and other statistical software packages like SPSS, SAS or Stata and to work with labelled data; this includes easy ways to get and set label attributes, to convert labelled vectors into factors (and vice versa), or to deal with multiple declared missing values etc.

This vignette gives an overview of functions to work with labelled data.

Labelled Data

Labelled data (or labelled vectors) is a common data structure in other statistical environments to store meta-information about variables, like variable names, value labels or multiple defined missing values.

Labelled data not only extends R’s capabilities to deal with proper value and variable labels, but also facilitates the representation of different types of missing values, like in other statistical software packages. Typically, in R, multiple declared missings cannot be represented in a similar way, like in ‘SPSS’ or ‘SAS’, with the regular missing values. However, the haven-package introduced tagged_na values, which can do this. Tagged NA’s work exactly like regular R missing values except that they store one additional byte of information: a tag, which is usually a letter (“a” to “z”) or also may be a character number (“0” to “9”). This allows to indicate different missings.

Functions of sjlabelled do not necessarily require vectors of class labelled or haven_labelled. The labelled class, implemented by the packages haven and labelled, may cause troubles with other packages, thus it’s only intended as being an intermediate data structure that should be converted to common R classes. However, coercing a labelled vector to other classes (like factor or numeric) typically means that meta information like value and variable label attributes are lost. Actually, there is no need to drop these attributes for non-labelled-class vectors. Functions like lm() simply copy these attributes to the data that is included in the returned object. Packages like sjPlot support labelled data for easily annotated data visualization. sjlabelled supports working with labelled data and offers functions to benefit from these features.

Note: Since package-version 2.0 of the haven-package, the labelled-class attribute was changed to haven_labelled, to avoid interferences with the Hmisc-package.

Labelled Data in haven and labelled

The labelled-package is intended to support labelled / haven_labelled metadata structures, thus the data structure of labelled vectors in haven and labelled is the same.

Labelled data in this format stores information about value labels, variable names and multiple defined missing values. However, variable names are only part of this information if data was imported with one of haven’s read-functions. Adding a variable label attribute is (at least up to version 1.0.0) not possible via the labelled()-constructor method.

A labelled vector can either be a numeric or character vector. Conversion to factors copies the value labels as factor levels, but drops the label attributes and missing information:

Labelled Data in sjlabelled

sjlabelled supports label attributes in haven-style (label and labels). You’re not restricted to the labelled class for vectors when working with sjlabelled and labelled data. Hence, you can have vectors of common R classes and still use information like variable or value labels.

Value Labels

Getting value labels

The get_labels()-method is a generic method to return value labels of a vector or data frame.

You can prefix the value labels with the associated values or return them as named vector with the values argument.

get_labels() also returns “labels” of factors, even if the factor has no label attributes.

To ensure that labels are only returned for vectors with label-attribute, use the attr.only argument.

If a vector has a label attribute, only these labels are returned. Non-labelled values are excluded from the output by default…

… however, you can add non-labelled values to the return value as well, using the non.labelled argument.

Tagged missing values can also be included in the output, using the drop.na argument.

Getting labelled values

The get_values() method returns the values for labelled values (i.e. values that have an associated label). We still use the vector x from the above examples.

With the drop.na argument you can omit those values from the return values that are defined as missing.

Setting value labels

With set_labels() you can add label attributes to any vector.

If more labels than values are given, only as many labels elements are used as values are present.

However, you can force to use all labels, even for values that are not in the vector, using the force.labels argument.

For vectors with more unique values than labels, additional labels for non-labelled values are added.

Use force.values to add only those labels that have been passed as argument.

To add explicit labels for values (without adding more labels than wanted and without dropping labels for values that do not appear in the vector), use a named vector of labels as argument. The arguments force.values and force.labels are ignored when using named vectors.

If you want to set different value labels for a complete data frame, if you provide the labels as a list. For each variable in the data frame, provide a list element with value labels as character vector. Note that the length of the list must be equal to the number of variables (columns) in the data frame.

You can use set_labels() within a pipe-workflow with dplyr.

Variable Labels

Getting variable labels

The get_label()-method returns the variable label of a vector or all variable labels from a data frame.

If a vector has no variable label, NULL is returned. However, get_label() also allows returning a standard value instead of NULL, in case the vector has no label attribute. This is useful to combine with deparse(substitute()) in function calls, so - for instance - the name of the vector can be used as default value if no variable labels are present.

If you want human-readable labels, you can use the case-argument, which will pass the labels to a string parser in the snakecase-package.

Setting variable labels

The set_label() function adds the variable label attribute to a vector. You can either return a new vector, or label an existing vector

set_label() can also set variable labels for a data frame. In this case, the variable attributes get an additional name attribute with the vector’s name. This makes it easier to see which label belongs to which vector.

An alternative to set_label() is var_labels(), which also works within pipe-workflows. var_labels() requires named vectors as arguments to match the column names of the input, and set the associated variable labels.

Missing Values

Defining missing values

set_na() converts values of a vector or of multiple vectors in a data frame into NAs. With as.tag = TRUE, set_na() creates tagged NA values, which means that these missing values get an information tag and a value label (which is, by default, the former value that was converted to NA). You can either return a new vector/data frame, or set NAs into an existing vector/data frame.

Getting missing values

The get_na() function returns all tagged NA values. We still use the vector x from the previous example.

To see the tags of the NA values, use the as.tag argument.

Replacing specific NA with values

While set_na() allows you to replace values with (tagged) NA’s, replace_na() (from package sjmisc) allows you to replace either all NA values of a vector or specific tagged NA values with a non-NA value.

library(sjmisc) # for replace_na()
data(efc)
str(efc$c84cop3)
#>  num [1:908] 2 3 1 3 1 3 4 2 3 1 ...
#>  - attr(*, "label")= chr "does caregiving cause difficulties in your relationship with your friends?"
#>  - attr(*, "labels")= Named num [1:4] 1 2 3 4
#>   ..- attr(*, "names")= chr [1:4] "Never" "Sometimes" "Often" "Always"

efc$c84cop3 <- set_na(efc$c84cop3, na = c(2, 3), as.tag = TRUE)
get_na(efc$c84cop3, as.tag = TRUE)
#> Sometimes     Often 
#>   "NA(2)"   "NA(3)"

# this would replace all NA's into "2"
dummy <- replace_na(efc$c84cop3, value = 2)

# labels of former tagged NA's are preserved
get_labels(dummy, drop.na = FALSE, values = "p")
#> [1] "[1] Never"         "[4] Always"        "[NA(2)] Sometimes"
#> [4] "[NA(3)] Often"
get_na(dummy, as.tag = TRUE)
#> Sometimes     Often 
#>   "NA(2)"   "NA(3)"

# No more NA values
frq(dummy)
#> 
#> does caregiving cause difficulties in your relationship with your friends? (x) <numeric>
#> # total N=908  valid N=908  mean=1.55  sd=0.77
#> 
#> Value |  Label |   N | Raw % | Valid % | Cum. %
#> -----------------------------------------------
#>     1 |  Never | 516 | 56.83 |   56.83 |  56.83
#>     2 |      2 | 340 | 37.44 |   37.44 |  94.27
#>     4 | Always |  52 |  5.73 |    5.73 | 100.00
#>  <NA> |   <NA> |   0 |  0.00 |    <NA> |   <NA>


# In this example, the tagged NA(2) is replaced with value 2
# the new value label for value 2 is "restored NA"
dummy <- replace_na(efc$c84cop3, value = 2, na.label = "restored NA", tagged.na = "2")

# Only one tagged NA remains
get_labels(dummy, drop.na = FALSE, values = "p")
#> [1] "[1] Never"       "[2] restored NA" "[4] Always"      "[NA(3)] Often"
get_na(dummy, as.tag = TRUE)
#>   Often 
#> "NA(3)"

# Some NA values remain
frq(dummy)
#> 
#> does caregiving cause difficulties in your relationship with your friends? (x) <numeric>
#> # total N=908  valid N=820  mean=1.50  sd=0.79
#> 
#> Value |       Label |   N | Raw % | Valid % | Cum. %
#> ----------------------------------------------------
#>     1 |       Never | 516 | 56.83 |   62.93 |  62.93
#>     2 | restored NA | 252 | 27.75 |   30.73 |  93.66
#>     4 |      Always |  52 |  5.73 |    6.34 | 100.00
#>  <NA> |        <NA> |  88 |  9.69 |    <NA> |   <NA>

Replacing values labels

With replace_labels(), you can replace (change) value labels of labelled values. This can also be used to change the labels of tagged missing values. Make sure to know the missing tag, which can be accessed via get_na().