The mission of hablar
is for you to get non-astonishing results! That means that functions return what you expected. R has some intuitive quirks that beginners and experienced programmers fail to identify. Some of the first weird features of R that hablar
solves:
Missing values NA
and irrational values Inf
, NaN
is dominant. For example, in R sum(c(1, 2, NA))
is NA
and not 3. In hablar
the addition of an underscore sum_(c(1, 2, NA))
returns 3, as is often expected.
Factors (categorical variables) that are converted to numeric returns the number of the category rather than the value. In hablar
the convert()
function always changes the type of the values.
Finding duplicates, and rows with NA
can be cumbersome. The functions find_duplicates()
and find_na()
make it easy to find where the data frame needs to be fixed. When the issues are found the utility replacement functions, e.g. if_else_()
, if_na()
, zero_if()
easily fixes many of the most common problems you face.
hablar
follows the syntax API of tidyverse
and works seamlessly with dplyr
and tidyselect
.
A common issue in R is how R treats missing values (i.e. NA
). Sometimes NA
in your data frame means that there is missing values in the sense that you need to estimate or replace them with values. But often it is not a problem! Often NA
means that there is no value, and should not be. hablar
provide useful functions that handle NA
intuitively. Let’s take a simple example:
#> # A tibble: 3 x 3
#> name graduation_date age
#> <chr> <date> <int>
#> 1 Fredrik 2016-06-15 21
#> 2 Maria NA 16
#> 3 Astrid 2014-06-15 23
min()
to min_()
The graduation_date
is missing for Maria. In this case it is not because we do not know. It is because she has not graduated yet, she is younger than Fredrik and Astrid. If we would like to know the first graduation date of the three observation in R with a naive min()
we get NA
. But with min_()
from hablar
we get the minimum value that is not missing. See:
df %>%
mutate(min_baseR = min(graduation_date),
min_hablar = min_(graduation_date))
#> # A tibble: 3 x 5
#> name graduation_date age min_baseR min_hablar
#> <chr> <date> <int> <date> <date>
#> 1 Fredrik 2016-06-15 21 NA 2014-06-15
#> 2 Maria NA 16 NA 2014-06-15
#> 3 Astrid 2014-06-15 23 NA 2014-06-15
The hablar
package provides the same functionality for
max_()
mean_()
median_()
sd_()
first_()
… and more. For more documentation type help(min_())
or vignette("s")
for an in-depth description.
In hablar
the function convert
provides a robust, readable and dynamic way to change type of a column.
mtcars %>%
convert(int(cyl, am),
num(disp:drat))
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> # … with 28 more rows
The above chunk converts the columns cyl
and am
to integers, and the columns disp
through drat
to numeric. If a column is of type factor
it always converts it to character before further conversion.
With convert
and tidyselect
you can easily change type of a wide range of columns.
mtcars %>%
convert(
chr(last_col()), # Last colum to character
int(1:2), # First two columns to integer
fct(hp, wt), # hp and wt to factors
dte(vs), # vs to date (if you really want)
num(contains("car")) # car as in carb to numeric
)
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <int> <int> <dbl> <fct> <dbl> <fct> <dbl> <date> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 1970-01-01 1 4 4
#> 2 21 6 160 110 3.9 2.875 17.0 1970-01-01 1 4 4
#> 3 22 4 108 93 3.85 2.32 18.6 1970-01-02 1 4 1
#> 4 21 6 258 110 3.08 3.215 19.4 1970-01-02 0 3 1
#> # … with 28 more rows
For more information, see help(hablar)
or vignette("convert")
.
When cleaning data you spend a lot of time understanding your data. Sometimes you get more row than you expected when doing a left_join()
. Or you did not know that certain column contained missing values NA
or irrational values like Inf
or NaN
.
In hablar
the find_*
functions speeds up your search for the problem. To find duplicated rows you simply df %>% find_duplicates()
. You can also find duplicates in in specific columns, which can be useful before joins.
# Create df with duplicates
df <- mtcars %>%
bind_rows(mtcars %>% slice(1, 5, 9))
# Return rows with duplicates in cyl and am
df %>%
find_duplicates(cyl, am)
#> # A tibble: 35 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> # … with 31 more rows
There are also find functions for other cases. For example find_na()
returns rows with missing values.
starwars %>%
find_na(height)
#> # A tibble: 6 x 13
#> name height mass hair_color skin_color eye_color birth_year gender homeworld
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Arve… NA NA brown fair brown NA male <NA>
#> 2 Finn NA NA black dark dark NA male <NA>
#> 3 Rey NA NA brown light hazel NA female <NA>
#> 4 Poe … NA NA brown light brown NA male <NA>
#> # … with 2 more rows, and 4 more variables: species <chr>, films <list>,
#> # vehicles <list>, starships <list>
If you rather want a Boolean value instead then e.g. check_duplicates()
returns TRUE
if the data frame contains duplicates, otherwise it returns FALSE
.
Let’s say that we have found a problem is caused by missing values in the column height
and you want to replace all missing values with the integer 100. hablar
comes with an additional ways of doing if-or-else.
starwars %>%
find_na(height) %>%
mutate(height = if_na(height, 100L))
#> # A tibble: 6 x 13
#> name height mass hair_color skin_color eye_color birth_year gender homeworld
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Arve… 100 NA brown fair brown NA male <NA>
#> 2 Finn 100 NA black dark dark NA male <NA>
#> 3 Rey 100 NA brown light hazel NA female <NA>
#> 4 Poe … 100 NA brown light brown NA male <NA>
#> # … with 2 more rows, and 4 more variables: species <chr>, films <list>,
#> # vehicles <list>, starships <list>
In the chunk above we successfully replaced all missing heights with the integer 100. hablar
also contain the self explained:
if_zero()
and zero_if()
if_inf()
and inf_if()
if_nan()
and nan_if()
which works in the same way as the examples above.
The generic function if_else_()
provides the same rigidity as if_else()
in dplyr
but ads some flexibility. In dplyr
you need to specify which type NA
should have. In if_else_()
you can write:
starwars %>%
mutate(skin_color = if_else_(hair_color == "brown", NA, hair_color))
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color eye_color birth_year gender homeworld
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Luke… 172 77 blond blond blue 19 male Tatooine
#> 2 C-3PO 167 75 <NA> <NA> yellow 112 <NA> Tatooine
#> 3 R2-D2 96 32 <NA> <NA> red 33 <NA> Naboo
#> 4 Dart… 202 136 none none yellow 41.9 male Tatooine
#> # … with 83 more rows, and 4 more variables: species <chr>, films <list>,
#> # vehicles <list>, starships <list>
In if_else()
from dplyr
you would have had to specified NA_character_
.