Univariate analysis of continuous and categorical variables

The first step in data exploration usually consists of univariate, descriptive analysis of all variables of interest. Tidycomm offers two basic functions to quickly output relevant statistics:

For demonstration purposes, we will use sample data from the Worlds of Journalism 2012-16 study included in tidycomm.

Describe continuous variables

describe() outputs several measures of central tendency and variability for all variables named in the function call:

WoJ %>% 
  describe(autonomy_selection, autonomy_emphasis, work_experience)
#> # A tibble: 3 x 13
#>   Variable     N Missing     M     SD   Min   Q25   Mdn   Q75   Max Range
#>   <chr>    <int>   <int> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 autonom~  1200       3  3.88  0.803     1     4     4     4     5     4
#> 2 autonom~  1200       5  4.08  0.793     1     4     4     5     5     4
#> 3 work_ex~  1200      13 17.8  10.9       1     8    17    25    53    52
#> # ... with 2 more variables: Skewness <dbl>, Kurtosis <dbl>

If no variables are passed to describe(), all numeric variables in the data are described:

WoJ %>% 
  describe()
#> # A tibble: 11 x 13
#>    Variable     N Missing     M     SD   Min   Q25   Mdn   Q75   Max Range
#>    <chr>    <int>   <int> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 autonom~  1200       3  3.88  0.803     1  4        4     4     5     4
#>  2 autonom~  1200       5  4.08  0.793     1  4        4     5     5     4
#>  3 ethics_1  1200       0  1.63  0.892     1  1        1     2     5     4
#>  4 ethics_2  1200       0  3.21  1.26      1  2        4     4     5     4
#>  5 ethics_3  1200       0  2.39  1.13      1  2        2     3     5     4
#>  6 ethics_4  1200       0  2.58  1.25      1  1.75     2     4     5     4
#>  7 work_ex~  1200      13 17.8  10.9       1  8       17    25    53    52
#>  8 trust_p~  1200       0  3.05  0.811     1  3        3     4     5     4
#>  9 trust_g~  1200       0  2.82  0.854     1  2        3     3     5     4
#> 10 trust_p~  1200       0  2.42  0.736     1  2        2     3     4     3
#> 11 trust_p~  1200       0  2.52  0.712     1  2        3     3     4     3
#> # ... with 2 more variables: Skewness <dbl>, Kurtosis <dbl>

Data can be grouped before describing:

WoJ %>% 
  dplyr::group_by(country) %>% 
  describe(autonomy_emphasis, autonomy_selection)
#> # A tibble: 10 x 14
#> # Groups:   country [5]
#>    country Variable     N Missing     M    SD   Min   Q25   Mdn   Q75   Max
#>    <fct>   <chr>    <int>   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Austria autonom~   207       2  4.19 0.614     2     4     4     5     5
#>  2 Denmark autonom~   376       1  3.90 0.856     1     4     4     4     5
#>  3 Germany autonom~   173       1  4.34 0.818     1     4     5     5     5
#>  4 Switze~ autonom~   233       0  4.07 0.694     1     4     4     4     5
#>  5 UK      autonom~   211       1  4.08 0.838     2     4     4     5     5
#>  6 Austria autonom~   207       0  3.92 0.637     2     4     4     4     5
#>  7 Denmark autonom~   376       0  3.76 0.892     1     3     4     4     5
#>  8 Germany autonom~   173       1  3.97 0.881     1     3     4     5     5
#>  9 Switze~ autonom~   233       0  3.92 0.628     1     4     4     4     5
#> 10 UK      autonom~   211       2  3.91 0.867     1     3     4     5     5
#> # ... with 3 more variables: Range <dbl>, Skewness <dbl>, Kurtosis <dbl>

Tabulate frequencies of categorical variables

tab_frequencies() outputs absolute and relative frequencies of all unique values of one or more categorical variables:

WoJ %>% 
  tab_frequencies(employment)
#> # A tibble: 3 x 5
#>   employment     n percent cum_n cum_percent
#>   <chr>      <int>   <dbl> <int>       <dbl>
#> 1 Freelancer   172   0.143   172       0.143
#> 2 Full-time    902   0.752  1074       0.895
#> 3 Part-time    126   0.105  1200       1

Passing more than one variable will compute relative frequencies based on all combinations of unique values:

WoJ %>% 
  tab_frequencies(employment, country)
#> # A tibble: 15 x 6
#>    employment country         n percent cum_n cum_percent
#>    <chr>      <fct>       <int>   <dbl> <int>       <dbl>
#>  1 Freelancer Austria        16 0.0133     16      0.0133
#>  2 Freelancer Denmark        85 0.0708    101      0.0842
#>  3 Freelancer Germany        29 0.0242    130      0.108 
#>  4 Freelancer Switzerland    10 0.00833   140      0.117 
#>  5 Freelancer UK             32 0.0267    172      0.143 
#>  6 Full-time  Austria       165 0.138     337      0.281 
#>  7 Full-time  Denmark       275 0.229     612      0.51  
#>  8 Full-time  Germany       139 0.116     751      0.626 
#>  9 Full-time  Switzerland   154 0.128     905      0.754 
#> 10 Full-time  UK            169 0.141    1074      0.895 
#> 11 Part-time  Austria        26 0.0217   1100      0.917 
#> 12 Part-time  Denmark        16 0.0133   1116      0.93  
#> 13 Part-time  Germany         5 0.00417  1121      0.934 
#> 14 Part-time  Switzerland    69 0.0575   1190      0.992 
#> 15 Part-time  UK             10 0.00833  1200      1

You can also group your data before. This will lead to within-group relative frequencies:

WoJ %>% 
  dplyr::group_by(country) %>% 
  tab_frequencies(employment)
#> # A tibble: 15 x 6
#> # Groups:   country [5]
#>    employment country         n percent cum_n cum_percent
#>    <chr>      <fct>       <int>   <dbl> <int>       <dbl>
#>  1 Freelancer Austria        16  0.0773    16      0.0773
#>  2 Full-time  Austria       165  0.797    181      0.874 
#>  3 Part-time  Austria        26  0.126    207      1     
#>  4 Freelancer Denmark        85  0.226     85      0.226 
#>  5 Full-time  Denmark       275  0.731    360      0.957 
#>  6 Part-time  Denmark        16  0.0426   376      1     
#>  7 Freelancer Germany        29  0.168     29      0.168 
#>  8 Full-time  Germany       139  0.803    168      0.971 
#>  9 Part-time  Germany         5  0.0289   173      1     
#> 10 Freelancer Switzerland    10  0.0429    10      0.0429
#> 11 Full-time  Switzerland   154  0.661    164      0.704 
#> 12 Part-time  Switzerland    69  0.296    233      1     
#> 13 Freelancer UK             32  0.152     32      0.152 
#> 14 Full-time  UK            169  0.801    201      0.953 
#> 15 Part-time  UK             10  0.0474   211      1

(Compare the columns percent, cum_n and cum_percent with the output above.)

Univariate analysis of continuous and categorical variables

2019-09-22

Describe continuous variables

Tabulate frequencies of categorical variables