fauxnaif

Alexander Rossell Hayes

2020-08-02

Getting started

To demonstrate the basic functionality of fauxnaif, let’s first load the package and an example dataset.

library(fauxnaif)
fauxnaif::faux_census
#> # A tibble: 20 x 6
#>    state  gender               age race            income religion              
#>    <chr>  <chr>              <dbl> <chr>            <dbl> <chr>                 
#>  1 CA     female                80 Native American 2.80e4 Christian             
#>  2 NY     Woman                 89 Latino          1.49e5 Spiritual not religio~
#>  3 CA     Female                48 White           4.79e5 Catholic              
#>  4 TX     Male                  63 latinx          8.50e4 christian             
#>  5 PA     Male                  47 asian           4.19e4 Baptist               
#>  6 TX     Gender is a socia~    57 Race is a soci~ 1.00e7 Religion is the opiat~
#>  7 Canada Male                  49 white           1.49e5 methodist             
#>  8 TX     Female                50 White           9.88e4 Lutheran              
#>  9 NY     f                    557 white           9.07e4 Agnostic              
#> 10 WA     F                     33 White           4.50e4 Jewish                
#> 11 TX     Male                  30 White           1.27e5 none                  
#> 12 OH     Non-binary            42 Caucasian       2.16e4 Roman Catholic        
#> 13 NC     Female                22 African Americ~ 7.42e4 atheist               
#> 14 LA     Male                   2 White           6.10e4 Christian             
#> 15 LA     Female                28 Black           2.00e4 Not religious         
#> 16 CA     male                  34 Asian American  7.74e4 Christian             
#> 17 TN     M                     64 white           1.00e7 Nothing               
#> 18 FL     Female                68 white           4.71e4 None                  
#> 19 OH     Male                  39 black           2.38e4 baptist               
#> 20 NH     male                  73 Hispanic        3.32e4 Christian

We can see the example dataset in full above. The data is a small section of census-like information. This dataset needs a lot of cleaning. Other tools like dplyr and tidyr would likely be needed to really analyze this data, but we’ll focus on the aspects that can be handled by fauxnaif.

The most basic case

First, let’s look at the simplest issue in this dataset: income.

faux_census$income
#>  [1]   28000  148800  479000   85000   41900 9999999  149000   98800   90750
#> [10]   45010  127000   21600   74200   61000   20000   77400 9999999   47100
#> [19]   23800   33200

Printing the vector of incomes, one value stands out: while most respondents’ have values in the tens to hundreds of thousands, two respondents have incomes of 9999999. It’s common for datasets you receive from other sources to use an unrealistically high value (often a string of 9s) to indicate NA. We can clean this using na_if_in().

na_if_in(faux_census$income, 9999999)
#>  [1]  28000 148800 479000  85000  41900     NA 149000  98800  90750  45010
#> [11] 127000  21600  74200  61000  20000  77400     NA  47100  23800  33200

The new variable has NAs in the place of those strings of 9s.

As an alternative, we can use the magrittr pipe (%>%) to pass an input into na_if_in():

faux_census$income %>% na_if_in(9999999)
#>  [1]  28000 148800 479000  85000  41900     NA 149000  98800  90750  45010
#> [11] 127000  21600  74200  61000  20000  77400     NA  47100  23800  33200

This produces the same result.

This task could have been completed using the version of na_if_in() included in the dplyr package. However, moving forward we will use more advanced functionality of fauxnaif.

Replacing multiple values

Let’s now examine the age variable:

faux_census$age
#>  [1]  80  89  48  63  47  57  49  50 557  33  30  42  22   2  28  34  64  68  39
#> [20]  73

In this case, we see two improbable values: 557 and 2 (assuming this is a survey of adults). Using dplyr, this would have to be addressed using two steps:

faux_census$age %>% dplyr::na_if(557) %>% dplyr::na_if(2)
#>  [1] 80 89 48 63 47 57 49 50 NA 33 30 42 22 NA 28 34 64 68 39 73

But using fauxnaif we can simplify this to a single step:

faux_census$age %>% na_if_in(557, 2)
#>  [1] 80 89 48 63 47 57 49 50 NA 33 30 42 22 NA 28 34 64 68 39 73

Specifying values to keep rather than values to discard

In the above example, we were able to examine our dataset and select the values that were unrealistic. In real-life analyses, we often can’t look at each observation one by one to find unrealistic values, but we often do know the range of realistic values. Using na_if_not(), we can specify which values are realistic and discard those that are not.

Returning to the age variable, let’s replace values with NA if they are not between 18 (the minimum age we expect to enter the survey) and 122 (the world record for the oldest person).

faux_census$age %>% na_if_not(18:122)
#>  [1] 80 89 48 63 47 57 49 50 NA 33 30 42 22 NA 28 34 64 68 39 73

This has the same effect as specifying the unrealistic values directly, but no longer requires you to directly examine each observation.

Replacing values using formulas

Another way to approach this problem is to use a formula to specify the range of acceptable values. This is particularly useful when dealing with non-integer values, where the colon operator (:) will not work:

23 %in% 18:122
#> [1] TRUE

but

23.5 %in% 18:122
#> [1] FALSE

Formulas in fauxnaif are based on the formula syntax used in rlang and purrr. They are introduced with a tilde (~) and indicate each observation with a dot (.).

To clean the age variable, we will need two formulas. One will replace values less than 18 and another will replace values greater than 122:

faux_census$age %>% na_if_in(~ . < 18, ~ . > 122)
#>  [1] 80 89 48 63 47 57 49 50 NA 33 30 42 22 NA 28 34 64 68 39 73

Using relational operators from other packages

If you really want to get this down to a single argument, you can use more advanced relational operators provided by packages like intrval, inops, or invctr.

For example, intrval’s closed interval operator (%[]%) allows you to check if a value is between two values, even if it is not an integer:

library(intrval)

23.5 %[]% c(18, 122)
#> [1] TRUE

With this, we can clean the age variable using only one formula argument:

faux_census$age %>% na_if_not(~ . %[]% c(18, 122))
#>  [1] 80 89 48 63 47 57 49 50 NA 33 30 42 22 NA 28 34 64 68 39 73

or

faux_census$age %>% na_if_in(~ . %][% c(18, 122))
#>  [1] 80 89 48 63 47 57 49 50 NA 33 30 42 22 NA 28 34 64 68 39 73

Using formulas for non-numeric variables

Formulas are not only useful when dealing with numeric variables. While it’s straightforward to use relational operators to specify replacements in numeric variables, we can also use more complex formulas to handle other data types.

Let’s take a look at the religion variable:

faux_census$religion
#>  [1] "Christian"                           
#>  [2] "Spiritual not religious"             
#>  [3] "Catholic"                            
#>  [4] "christian"                           
#>  [5] "Baptist"                             
#>  [6] "Religion is the opiate of the people"
#>  [7] "methodist"                           
#>  [8] "Lutheran"                            
#>  [9] "Agnostic"                            
#> [10] "Jewish"                              
#> [11] "none"                                
#> [12] "Roman Catholic"                      
#> [13] "atheist"                             
#> [14] "Christian"                           
#> [15] "Not religious"                       
#> [16] "Christian"                           
#> [17] "Nothing"                             
#> [18] "None"                                
#> [19] "baptist"                             
#> [20] "Christian"

While there are a few things we might want to clean in this variable, one clear issue is the respondent who did not answer the question but instead used the space to give an opinion: “Religion is the opiate of the people”.

We could use the most basic form of na_if_in() to simply remove this answer:

faux_census$religion %>% na_if_in("Religion is the opiate of the people")
#>  [1] "Christian"               "Spiritual not religious"
#>  [3] "Catholic"                "christian"              
#>  [5] "Baptist"                 NA                       
#>  [7] "methodist"               "Lutheran"               
#>  [9] "Agnostic"                "Jewish"                 
#> [11] "none"                    "Roman Catholic"         
#> [13] "atheist"                 "Christian"              
#> [15] "Not religious"           "Christian"              
#> [17] "Nothing"                 "None"                   
#> [19] "baptist"                 "Christian"

But in a larger analysis, we may prefer to have a simple rule for excluding answers. Perhaps we decide that answers longer than 25 characters are unlikely to be genuine. In that case, we can use a formula operating on the number of characters (nchar(.)) in a response:

faux_census$religion %>% na_if_in(~ nchar(.) > 25)
#>  [1] "Christian"               "Spiritual not religious"
#>  [3] "Catholic"                "christian"              
#>  [5] "Baptist"                 NA                       
#>  [7] "methodist"               "Lutheran"               
#>  [9] "Agnostic"                "Jewish"                 
#> [11] "none"                    "Roman Catholic"         
#> [13] "atheist"                 "Christian"              
#> [15] "Not religious"           "Christian"              
#> [17] "Nothing"                 "None"                   
#> [19] "baptist"                 "Christian"

Replacing values using functions

Finally, there are cases when we can use a simple function to replace values.

Returning to the income variable, we know that NA is indicated using the largest value. Rather than specifying it directly, we can simply tell fauxnaif to replace the variable’s maximum value:

faux_census$income %>% na_if_in(max)
#>  [1]  28000 148800 479000  85000  41900     NA 149000  98800  90750  45010
#> [11] 127000  21600  74200  61000  20000  77400     NA  47100  23800  33200

We can do the same with the age variable, where both the lowest and highest values are unrealistic:

faux_census$age %>% na_if_in(min, max)
#>  [1] 80 89 48 63 47 57 49 50 NA 33 30 42 22 NA 28 34 64 68 39 73

But beware! If no respondent had given an unrealistic answer, replacing the minimum or maximum values could result in the loss of real data! It is often better to use more complicated formulas than simpler functions for your replacements.

Replacing values in data frames

Often in data analysis, we prefer to work within a single data frame than operating on individual vectors. fauxnaif is built to handle this use case.

A simple solution is to use na_if_in() or na_if_not() within dplyr’s mutate() function.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:fauxnaif':
#> 
#>     na_if
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

faux_census %>% mutate(income = na_if_in(income, 9999999))
#> # A tibble: 20 x 6
#>    state  gender               age race            income religion              
#>    <chr>  <chr>              <dbl> <chr>            <dbl> <chr>                 
#>  1 CA     female                80 Native American  28000 Christian             
#>  2 NY     Woman                 89 Latino          148800 Spiritual not religio~
#>  3 CA     Female                48 White           479000 Catholic              
#>  4 TX     Male                  63 latinx           85000 christian             
#>  5 PA     Male                  47 asian            41900 Baptist               
#>  6 TX     Gender is a socia~    57 Race is a soci~     NA Religion is the opiat~
#>  7 Canada Male                  49 white           149000 methodist             
#>  8 TX     Female                50 White            98800 Lutheran              
#>  9 NY     f                    557 white            90750 Agnostic              
#> 10 WA     F                     33 White            45010 Jewish                
#> 11 TX     Male                  30 White           127000 none                  
#> 12 OH     Non-binary            42 Caucasian        21600 Roman Catholic        
#> 13 NC     Female                22 African Americ~  74200 atheist               
#> 14 LA     Male                   2 White            61000 Christian             
#> 15 LA     Female                28 Black            20000 Not religious         
#> 16 CA     male                  34 Asian American   77400 Christian             
#> 17 TN     M                     64 white               NA Nothing               
#> 18 FL     Female                68 white            47100 None                  
#> 19 OH     Male                  39 black            23800 baptist               
#> 20 NH     male                  73 Hispanic         33200 Christian

Replacing values in multiple columns

Sometimes, the same replacement function can be used in multiple columns. Here, the respondent who didn’t give a real answer to the religion question seemed to do the same with the gender and race questions. You can specify multiple columns using dplyr’s across() is you would like to make replacements based on the same criteria:

faux_census %>%
  mutate(across(c(religion, gender, race), na_if_in, ~ nchar(.) > 25))
#> # A tibble: 20 x 6
#>    state  gender       age race              income religion               
#>    <chr>  <chr>      <dbl> <chr>              <dbl> <chr>                  
#>  1 CA     female        80 Native American    28000 Christian              
#>  2 NY     Woman         89 Latino            148800 Spiritual not religious
#>  3 CA     Female        48 White             479000 Catholic               
#>  4 TX     Male          63 latinx             85000 christian              
#>  5 PA     Male          47 asian              41900 Baptist                
#>  6 TX     <NA>          57 <NA>             9999999 <NA>                   
#>  7 Canada Male          49 white             149000 methodist              
#>  8 TX     Female        50 White              98800 Lutheran               
#>  9 NY     f            557 white              90750 Agnostic               
#> 10 WA     F             33 White              45010 Jewish                 
#> 11 TX     Male          30 White             127000 none                   
#> 12 OH     Non-binary    42 Caucasian          21600 Roman Catholic         
#> 13 NC     Female        22 African American   74200 atheist                
#> 14 LA     Male           2 White              61000 Christian              
#> 15 LA     Female        28 Black              20000 Not religious          
#> 16 CA     male          34 Asian American     77400 Christian              
#> 17 TN     M             64 white            9999999 Nothing                
#> 18 FL     Female        68 white              47100 None                   
#> 19 OH     Male          39 black              23800 baptist                
#> 20 NH     male          73 Hispanic           33200 Christian

Replacing values using a predicate function

Rather than specifying columns manually, we can also select columns using a predicate function with dplyr’s where().

For example, we may want to remove strings of 9s in any numeric column:

faux_census %>% mutate(across(where(is.numeric), na_if_in, ~ grepl("999", .)))
#> # A tibble: 20 x 6
#>    state  gender               age race            income religion              
#>    <chr>  <chr>              <dbl> <chr>            <dbl> <chr>                 
#>  1 CA     female                80 Native American  28000 Christian             
#>  2 NY     Woman                 89 Latino          148800 Spiritual not religio~
#>  3 CA     Female                48 White           479000 Catholic              
#>  4 TX     Male                  63 latinx           85000 christian             
#>  5 PA     Male                  47 asian            41900 Baptist               
#>  6 TX     Gender is a socia~    57 Race is a soci~     NA Religion is the opiat~
#>  7 Canada Male                  49 white           149000 methodist             
#>  8 TX     Female                50 White            98800 Lutheran              
#>  9 NY     f                    557 white            90750 Agnostic              
#> 10 WA     F                     33 White            45010 Jewish                
#> 11 TX     Male                  30 White           127000 none                  
#> 12 OH     Non-binary            42 Caucasian        21600 Roman Catholic        
#> 13 NC     Female                22 African Americ~  74200 atheist               
#> 14 LA     Male                   2 White            61000 Christian             
#> 15 LA     Female                28 Black            20000 Not religious         
#> 16 CA     male                  34 Asian American   77400 Christian             
#> 17 TN     M                     64 white               NA Nothing               
#> 18 FL     Female                68 white            47100 None                  
#> 19 OH     Male                  39 black            23800 baptist               
#> 20 NH     male                  73 Hispanic         33200 Christian

Replacing values in all columns

While this replacement was intended for three specific columns, no variable contains a legitimate answer longer than 25 characters. In this case, rather than specifying the variable of interest, we can simply use dplyr’s everything() to make the replacement in all columns:

faux_census %>% mutate(across(everything(), na_if_in, ~ nchar(.) > 25))
#> # A tibble: 20 x 6
#>    state  gender       age race              income religion               
#>    <chr>  <chr>      <dbl> <chr>              <dbl> <chr>                  
#>  1 CA     female        80 Native American    28000 Christian              
#>  2 NY     Woman         89 Latino            148800 Spiritual not religious
#>  3 CA     Female        48 White             479000 Catholic               
#>  4 TX     Male          63 latinx             85000 christian              
#>  5 PA     Male          47 asian              41900 Baptist                
#>  6 TX     <NA>          57 <NA>             9999999 <NA>                   
#>  7 Canada Male          49 white             149000 methodist              
#>  8 TX     Female        50 White              98800 Lutheran               
#>  9 NY     f            557 white              90750 Agnostic               
#> 10 WA     F             33 White              45010 Jewish                 
#> 11 TX     Male          30 White             127000 none                   
#> 12 OH     Non-binary    42 Caucasian          21600 Roman Catholic         
#> 13 NC     Female        22 African American   74200 atheist                
#> 14 LA     Male           2 White              61000 Christian              
#> 15 LA     Female        28 Black              20000 Not religious          
#> 16 CA     male          34 Asian American     77400 Christian              
#> 17 TN     M             64 white            9999999 Nothing                
#> 18 FL     Female        68 white              47100 None                   
#> 19 OH     Male          39 black              23800 baptist                
#> 20 NH     male          73 Hispanic           33200 Christian

Putting it all together

In a data analysis pipeline, we can combine several steps to produce a usable dataset. Combining our interval check for age, our check for strings of 9s in numeric variables, and our check for long responses in character variables, we can yield much cleaner data:

faux_census %>%
  mutate(
    age = na_if_not(age, 18:122),
    across(where(is.numeric), na_if_in, ~ grepl("999", .)),
    across(everything(), na_if_in, ~ nchar(.) > 25)
  )
#> # A tibble: 20 x 6
#>    state  gender       age race             income religion               
#>    <chr>  <chr>      <dbl> <chr>             <dbl> <chr>                  
#>  1 CA     female        80 Native American   28000 Christian              
#>  2 NY     Woman         89 Latino           148800 Spiritual not religious
#>  3 CA     Female        48 White            479000 Catholic               
#>  4 TX     Male          63 latinx            85000 christian              
#>  5 PA     Male          47 asian             41900 Baptist                
#>  6 TX     <NA>          57 <NA>                 NA <NA>                   
#>  7 Canada Male          49 white            149000 methodist              
#>  8 TX     Female        50 White             98800 Lutheran               
#>  9 NY     f             NA white             90750 Agnostic               
#> 10 WA     F             33 White             45010 Jewish                 
#> 11 TX     Male          30 White            127000 none                   
#> 12 OH     Non-binary    42 Caucasian         21600 Roman Catholic         
#> 13 NC     Female        22 African American  74200 atheist                
#> 14 LA     Male          NA White             61000 Christian              
#> 15 LA     Female        28 Black             20000 Not religious          
#> 16 CA     male          34 Asian American    77400 Christian              
#> 17 TN     M             64 white                NA Nothing                
#> 18 FL     Female        68 white             47100 None                   
#> 19 OH     Male          39 black             23800 baptist                
#> 20 NH     male          73 Hispanic          33200 Christian