An introduction to hablar

The mission of hablar is for you to get non-astonishing results! That means that functions return what you expected. R has some intuitive quirks that beginners and experienced programmers fail to identify. Some of the first weird features of R that hablar solves:

hablar follows the syntax API of tidyverse and works seamlessly with dplyr and tidyselect.

Missing values that astonishes you

A common issue in R is how R treats missing values (i.e. NA). Sometimes NA in your data frame means that there is missing values in the sense that you need to estimate or replace them with values. But often it is not a problem! Often NA means that there is no value, and should not be. hablar provide useful functions that handle NA intuitively. Let’s take a simple example:

#> # A tibble: 3 x 3
#>   name    graduation_date   age
#>   <chr>   <date>          <int>
#> 1 Fredrik 2016-06-15         21
#> 2 Maria   NA                 16
#> 3 Astrid  2014-06-15         23

Change `min()` to `min_()`

The graduation_date is missing for Maria. In this case it is not because we do not know. It is because she has not graduated yet, she is younger than Fredrik and Astrid. If we would like to know the first graduation date of the three observation in R with a naive min() we get NA. But with min_() from hablar we get the minimum value that is not missing. See:

df %>% 
  mutate(min_baseR = min(graduation_date),
         min_hablar = min_(graduation_date))
#> # A tibble: 3 x 5
#>   name    graduation_date   age min_baseR  min_hablar
#>   <chr>   <date>          <int> <date>     <date>    
#> 1 Fredrik 2016-06-15         21 NA         2014-06-15
#> 2 Maria   NA                 16 NA         2014-06-15
#> 3 Astrid  2014-06-15         23 NA         2014-06-15

The hablar package provides the same functionality for

max_()
mean_()
median_()
sd_()
first_()

… and more. For more documentation type help(min_()) or vignette("s") for an in-depth description.

Change type in a snap - safely

In hablar the function convert provides a robust, readable and dynamic way to change type of a column.

mtcars %>% 
  convert(int(cyl, am),
          num(disp:drat))
#> # A tibble: 32 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
#> 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
#> # … with 28 more rows

The above chunk converts the columns cyl and am to integers, and the columns disp through drat to numeric. If a column is of type factor it always converts it to character before further conversion.

Fix all your types in the same function

With convert and tidyselect you can easily change type of a wide range of columns.

mtcars %>% 
  convert(
    chr(last_col()),       # Last colum to character
    int(1:2),              # First two columns to integer
    fct(hp, wt),           # hp and wt to factors
    dte(vs),               # vs to date (if you really want)
    num(contains("car"))   # car as in carb to numeric
  )           
#> # A tibble: 32 x 11
#>     mpg   cyl  disp hp     drat wt     qsec vs            am  gear  carb
#>   <int> <int> <dbl> <fct> <dbl> <fct> <dbl> <date>     <dbl> <dbl> <dbl>
#> 1    21     6   160 110    3.9  2.62   16.5 1970-01-01     1     4     4
#> 2    21     6   160 110    3.9  2.875  17.0 1970-01-01     1     4     4
#> 3    22     4   108 93     3.85 2.32   18.6 1970-01-02     1     4     1
#> 4    21     6   258 110    3.08 3.215  19.4 1970-01-02     0     3     1
#> # … with 28 more rows

For more information, see help(hablar) or vignette("convert").

Find the problem

When cleaning data you spend a lot of time understanding your data. Sometimes you get more row than you expected when doing a left_join(). Or you did not know that certain column contained missing values NA or irrational values like Inf or NaN.

In hablar the find_* functions speeds up your search for the problem. To find duplicated rows you simply df %>% find_duplicates(). You can also find duplicates in in specific columns, which can be useful before joins.

# Create df with duplicates
df <- mtcars %>% 
  bind_rows(mtcars %>% slice(1, 5, 9))

# Return rows with duplicates in cyl and am
df %>% 
  find_duplicates(cyl, am)
#> # A tibble: 35 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
#> # … with 31 more rows

There are also find functions for other cases. For example find_na() returns rows with missing values.

starwars %>% 
  find_na(height)
#> # A tibble: 6 x 13
#>   name  height  mass hair_color skin_color eye_color birth_year gender homeworld
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>    
#> 1 Arve…     NA    NA brown      fair       brown             NA male   <NA>     
#> 2 Finn      NA    NA black      dark       dark              NA male   <NA>     
#> 3 Rey       NA    NA brown      light      hazel             NA female <NA>     
#> 4 Poe …     NA    NA brown      light      brown             NA male   <NA>     
#> # … with 2 more rows, and 4 more variables: species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

If you rather want a Boolean value instead then e.g. check_duplicates() returns TRUE if the data frame contains duplicates, otherwise it returns FALSE.

…apply the solution

Let’s say that we have found a problem is caused by missing values in the column height and you want to replace all missing values with the integer 100. hablar comes with an additional ways of doing if-or-else.

starwars %>% 
  find_na(height) %>% 
  mutate(height = if_na(height, 100L))
#> # A tibble: 6 x 13
#>   name  height  mass hair_color skin_color eye_color birth_year gender homeworld
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>    
#> 1 Arve…    100    NA brown      fair       brown             NA male   <NA>     
#> 2 Finn     100    NA black      dark       dark              NA male   <NA>     
#> 3 Rey      100    NA brown      light      hazel             NA female <NA>     
#> 4 Poe …    100    NA brown      light      brown             NA male   <NA>     
#> # … with 2 more rows, and 4 more variables: species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

In the chunk above we successfully replaced all missing heights with the integer 100. hablar also contain the self explained:

if_zero() and zero_if()
if_inf() and inf_if()
if_nan() and nan_if()

which works in the same way as the examples above.

Introducing a third way to if or else

The generic function if_else_() provides the same rigidity as if_else() in dplyr but ads some flexibility. In dplyr you need to specify which type NA should have. In if_else_() you can write:

starwars %>% 
  mutate(skin_color = if_else_(hair_color == "brown", NA, hair_color))
#> # A tibble: 87 x 13
#>   name  height  mass hair_color skin_color eye_color birth_year gender homeworld
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>    
#> 1 Luke…    172    77 blond      blond      blue            19   male   Tatooine 
#> 2 C-3PO    167    75 <NA>       <NA>       yellow         112   <NA>   Tatooine 
#> 3 R2-D2     96    32 <NA>       <NA>       red             33   <NA>   Naboo    
#> 4 Dart…    202   136 none       none       yellow          41.9 male   Tatooine 
#> # … with 83 more rows, and 4 more variables: species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

In if_else() from dplyr you would have had to specified NA_character_.