Univariate analysis of continuous and categorical variables

2019-09-22

The first step in data exploration usually consists of univariate, descriptive analysis of all variables of interest. Tidycomm offers two basic functions to quickly output relevant statistics:

For demonstration purposes, we will use sample data from the Worlds of Journalism 2012-16 study included in tidycomm.

WoJ
#> # A tibble: 1,200 x 15
#>    country reach employment temp_contract autonomy_select~ autonomy_emphas~
#>    <fct>   <fct> <chr>      <fct>                    <dbl>            <dbl>
#>  1 Germany Nati~ Full-time  Permanent                    5                4
#>  2 Germany Nati~ Full-time  Permanent                    3                4
#>  3 Switze~ Regi~ Full-time  Permanent                    4                4
#>  4 Switze~ Local Part-time  Permanent                    4                5
#>  5 Austria Nati~ Part-time  Permanent                    4                4
#>  6 Switze~ Local Freelancer <NA>                         4                4
#>  7 Germany Local Full-time  Permanent                    4                4
#>  8 Denmark Nati~ Full-time  Permanent                    3                3
#>  9 Switze~ Local Full-time  Permanent                    5                5
#> 10 Denmark Nati~ Full-time  Permanent                    2                4
#> # ... with 1,190 more rows, and 9 more variables: ethics_1 <dbl>,
#> #   ethics_2 <dbl>, ethics_3 <dbl>, ethics_4 <dbl>, work_experience <dbl>,
#> #   trust_parliament <dbl>, trust_government <dbl>, trust_parties <dbl>,
#> #   trust_politicians <dbl>

Describe continuous variables

describe() outputs several measures of central tendency and variability for all variables named in the function call:

WoJ %>% 
  describe(autonomy_selection, autonomy_emphasis, work_experience)
#> # A tibble: 3 x 13
#>   Variable     N Missing     M     SD   Min   Q25   Mdn   Q75   Max Range
#>   <chr>    <int>   <int> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 autonom~  1200       3  3.88  0.803     1     4     4     4     5     4
#> 2 autonom~  1200       5  4.08  0.793     1     4     4     5     5     4
#> 3 work_ex~  1200      13 17.8  10.9       1     8    17    25    53    52
#> # ... with 2 more variables: Skewness <dbl>, Kurtosis <dbl>

If no variables are passed to describe(), all numeric variables in the data are described:

WoJ %>% 
  describe()
#> # A tibble: 11 x 13
#>    Variable     N Missing     M     SD   Min   Q25   Mdn   Q75   Max Range
#>    <chr>    <int>   <int> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 autonom~  1200       3  3.88  0.803     1  4        4     4     5     4
#>  2 autonom~  1200       5  4.08  0.793     1  4        4     5     5     4
#>  3 ethics_1  1200       0  1.63  0.892     1  1        1     2     5     4
#>  4 ethics_2  1200       0  3.21  1.26      1  2        4     4     5     4
#>  5 ethics_3  1200       0  2.39  1.13      1  2        2     3     5     4
#>  6 ethics_4  1200       0  2.58  1.25      1  1.75     2     4     5     4
#>  7 work_ex~  1200      13 17.8  10.9       1  8       17    25    53    52
#>  8 trust_p~  1200       0  3.05  0.811     1  3        3     4     5     4
#>  9 trust_g~  1200       0  2.82  0.854     1  2        3     3     5     4
#> 10 trust_p~  1200       0  2.42  0.736     1  2        2     3     4     3
#> 11 trust_p~  1200       0  2.52  0.712     1  2        3     3     4     3
#> # ... with 2 more variables: Skewness <dbl>, Kurtosis <dbl>

Data can be grouped before describing:

WoJ %>% 
  dplyr::group_by(country) %>% 
  describe(autonomy_emphasis, autonomy_selection)
#> # A tibble: 10 x 14
#> # Groups:   country [5]
#>    country Variable     N Missing     M    SD   Min   Q25   Mdn   Q75   Max
#>    <fct>   <chr>    <int>   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Austria autonom~   207       2  4.19 0.614     2     4     4     5     5
#>  2 Denmark autonom~   376       1  3.90 0.856     1     4     4     4     5
#>  3 Germany autonom~   173       1  4.34 0.818     1     4     5     5     5
#>  4 Switze~ autonom~   233       0  4.07 0.694     1     4     4     4     5
#>  5 UK      autonom~   211       1  4.08 0.838     2     4     4     5     5
#>  6 Austria autonom~   207       0  3.92 0.637     2     4     4     4     5
#>  7 Denmark autonom~   376       0  3.76 0.892     1     3     4     4     5
#>  8 Germany autonom~   173       1  3.97 0.881     1     3     4     5     5
#>  9 Switze~ autonom~   233       0  3.92 0.628     1     4     4     4     5
#> 10 UK      autonom~   211       2  3.91 0.867     1     3     4     5     5
#> # ... with 3 more variables: Range <dbl>, Skewness <dbl>, Kurtosis <dbl>

Tabulate frequencies of categorical variables

tab_frequencies() outputs absolute and relative frequencies of all unique values of one or more categorical variables:

WoJ %>% 
  tab_frequencies(employment)
#> # A tibble: 3 x 5
#>   employment     n percent cum_n cum_percent
#>   <chr>      <int>   <dbl> <int>       <dbl>
#> 1 Freelancer   172   0.143   172       0.143
#> 2 Full-time    902   0.752  1074       0.895
#> 3 Part-time    126   0.105  1200       1

Passing more than one variable will compute relative frequencies based on all combinations of unique values:

WoJ %>% 
  tab_frequencies(employment, country)
#> # A tibble: 15 x 6
#>    employment country         n percent cum_n cum_percent
#>    <chr>      <fct>       <int>   <dbl> <int>       <dbl>
#>  1 Freelancer Austria        16 0.0133     16      0.0133
#>  2 Freelancer Denmark        85 0.0708    101      0.0842
#>  3 Freelancer Germany        29 0.0242    130      0.108 
#>  4 Freelancer Switzerland    10 0.00833   140      0.117 
#>  5 Freelancer UK             32 0.0267    172      0.143 
#>  6 Full-time  Austria       165 0.138     337      0.281 
#>  7 Full-time  Denmark       275 0.229     612      0.51  
#>  8 Full-time  Germany       139 0.116     751      0.626 
#>  9 Full-time  Switzerland   154 0.128     905      0.754 
#> 10 Full-time  UK            169 0.141    1074      0.895 
#> 11 Part-time  Austria        26 0.0217   1100      0.917 
#> 12 Part-time  Denmark        16 0.0133   1116      0.93  
#> 13 Part-time  Germany         5 0.00417  1121      0.934 
#> 14 Part-time  Switzerland    69 0.0575   1190      0.992 
#> 15 Part-time  UK             10 0.00833  1200      1

You can also group your data before. This will lead to within-group relative frequencies:

WoJ %>% 
  dplyr::group_by(country) %>% 
  tab_frequencies(employment)
#> # A tibble: 15 x 6
#> # Groups:   country [5]
#>    employment country         n percent cum_n cum_percent
#>    <chr>      <fct>       <int>   <dbl> <int>       <dbl>
#>  1 Freelancer Austria        16  0.0773    16      0.0773
#>  2 Full-time  Austria       165  0.797    181      0.874 
#>  3 Part-time  Austria        26  0.126    207      1     
#>  4 Freelancer Denmark        85  0.226     85      0.226 
#>  5 Full-time  Denmark       275  0.731    360      0.957 
#>  6 Part-time  Denmark        16  0.0426   376      1     
#>  7 Freelancer Germany        29  0.168     29      0.168 
#>  8 Full-time  Germany       139  0.803    168      0.971 
#>  9 Part-time  Germany         5  0.0289   173      1     
#> 10 Freelancer Switzerland    10  0.0429    10      0.0429
#> 11 Full-time  Switzerland   154  0.661    164      0.704 
#> 12 Part-time  Switzerland    69  0.296    233      1     
#> 13 Freelancer UK             32  0.152     32      0.152 
#> 14 Full-time  UK            169  0.801    201      0.953 
#> 15 Part-time  UK             10  0.0474   211      1

(Compare the columns percent, cum_n and cum_percent with the output above.)