The first step in data exploration usually consists of univariate, descriptive analysis of all variables of interest. Tidycomm offers two basic functions to quickly output relevant statistics:
describe()
for continuous variablestab_frequencies()
for categorical variablesFor demonstration purposes, we will use sample data from the Worlds of Journalism 2012-16 study included in tidycomm.
WoJ
#> # A tibble: 1,200 x 15
#> country reach employment temp_contract autonomy_select~ autonomy_emphas~
#> <fct> <fct> <chr> <fct> <dbl> <dbl>
#> 1 Germany Nati~ Full-time Permanent 5 4
#> 2 Germany Nati~ Full-time Permanent 3 4
#> 3 Switze~ Regi~ Full-time Permanent 4 4
#> 4 Switze~ Local Part-time Permanent 4 5
#> 5 Austria Nati~ Part-time Permanent 4 4
#> 6 Switze~ Local Freelancer <NA> 4 4
#> 7 Germany Local Full-time Permanent 4 4
#> 8 Denmark Nati~ Full-time Permanent 3 3
#> 9 Switze~ Local Full-time Permanent 5 5
#> 10 Denmark Nati~ Full-time Permanent 2 4
#> # ... with 1,190 more rows, and 9 more variables: ethics_1 <dbl>,
#> # ethics_2 <dbl>, ethics_3 <dbl>, ethics_4 <dbl>, work_experience <dbl>,
#> # trust_parliament <dbl>, trust_government <dbl>, trust_parties <dbl>,
#> # trust_politicians <dbl>
describe()
outputs several measures of central tendency and variability for all variables named in the function call:
WoJ %>%
describe(autonomy_selection, autonomy_emphasis, work_experience)
#> # A tibble: 3 x 13
#> Variable N Missing M SD Min Q25 Mdn Q75 Max Range
#> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 autonom~ 1200 3 3.88 0.803 1 4 4 4 5 4
#> 2 autonom~ 1200 5 4.08 0.793 1 4 4 5 5 4
#> 3 work_ex~ 1200 13 17.8 10.9 1 8 17 25 53 52
#> # ... with 2 more variables: Skewness <dbl>, Kurtosis <dbl>
If no variables are passed to describe()
, all numeric variables in the data are described:
WoJ %>%
describe()
#> # A tibble: 11 x 13
#> Variable N Missing M SD Min Q25 Mdn Q75 Max Range
#> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 autonom~ 1200 3 3.88 0.803 1 4 4 4 5 4
#> 2 autonom~ 1200 5 4.08 0.793 1 4 4 5 5 4
#> 3 ethics_1 1200 0 1.63 0.892 1 1 1 2 5 4
#> 4 ethics_2 1200 0 3.21 1.26 1 2 4 4 5 4
#> 5 ethics_3 1200 0 2.39 1.13 1 2 2 3 5 4
#> 6 ethics_4 1200 0 2.58 1.25 1 1.75 2 4 5 4
#> 7 work_ex~ 1200 13 17.8 10.9 1 8 17 25 53 52
#> 8 trust_p~ 1200 0 3.05 0.811 1 3 3 4 5 4
#> 9 trust_g~ 1200 0 2.82 0.854 1 2 3 3 5 4
#> 10 trust_p~ 1200 0 2.42 0.736 1 2 2 3 4 3
#> 11 trust_p~ 1200 0 2.52 0.712 1 2 3 3 4 3
#> # ... with 2 more variables: Skewness <dbl>, Kurtosis <dbl>
Data can be grouped before describing:
WoJ %>%
dplyr::group_by(country) %>%
describe(autonomy_emphasis, autonomy_selection)
#> # A tibble: 10 x 14
#> # Groups: country [5]
#> country Variable N Missing M SD Min Q25 Mdn Q75 Max
#> <fct> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Austria autonom~ 207 2 4.19 0.614 2 4 4 5 5
#> 2 Denmark autonom~ 376 1 3.90 0.856 1 4 4 4 5
#> 3 Germany autonom~ 173 1 4.34 0.818 1 4 5 5 5
#> 4 Switze~ autonom~ 233 0 4.07 0.694 1 4 4 4 5
#> 5 UK autonom~ 211 1 4.08 0.838 2 4 4 5 5
#> 6 Austria autonom~ 207 0 3.92 0.637 2 4 4 4 5
#> 7 Denmark autonom~ 376 0 3.76 0.892 1 3 4 4 5
#> 8 Germany autonom~ 173 1 3.97 0.881 1 3 4 5 5
#> 9 Switze~ autonom~ 233 0 3.92 0.628 1 4 4 4 5
#> 10 UK autonom~ 211 2 3.91 0.867 1 3 4 5 5
#> # ... with 3 more variables: Range <dbl>, Skewness <dbl>, Kurtosis <dbl>
tab_frequencies()
outputs absolute and relative frequencies of all unique values of one or more categorical variables:
WoJ %>%
tab_frequencies(employment)
#> # A tibble: 3 x 5
#> employment n percent cum_n cum_percent
#> <chr> <int> <dbl> <int> <dbl>
#> 1 Freelancer 172 0.143 172 0.143
#> 2 Full-time 902 0.752 1074 0.895
#> 3 Part-time 126 0.105 1200 1
Passing more than one variable will compute relative frequencies based on all combinations of unique values:
WoJ %>%
tab_frequencies(employment, country)
#> # A tibble: 15 x 6
#> employment country n percent cum_n cum_percent
#> <chr> <fct> <int> <dbl> <int> <dbl>
#> 1 Freelancer Austria 16 0.0133 16 0.0133
#> 2 Freelancer Denmark 85 0.0708 101 0.0842
#> 3 Freelancer Germany 29 0.0242 130 0.108
#> 4 Freelancer Switzerland 10 0.00833 140 0.117
#> 5 Freelancer UK 32 0.0267 172 0.143
#> 6 Full-time Austria 165 0.138 337 0.281
#> 7 Full-time Denmark 275 0.229 612 0.51
#> 8 Full-time Germany 139 0.116 751 0.626
#> 9 Full-time Switzerland 154 0.128 905 0.754
#> 10 Full-time UK 169 0.141 1074 0.895
#> 11 Part-time Austria 26 0.0217 1100 0.917
#> 12 Part-time Denmark 16 0.0133 1116 0.93
#> 13 Part-time Germany 5 0.00417 1121 0.934
#> 14 Part-time Switzerland 69 0.0575 1190 0.992
#> 15 Part-time UK 10 0.00833 1200 1
You can also group your data before. This will lead to within-group relative frequencies:
WoJ %>%
dplyr::group_by(country) %>%
tab_frequencies(employment)
#> # A tibble: 15 x 6
#> # Groups: country [5]
#> employment country n percent cum_n cum_percent
#> <chr> <fct> <int> <dbl> <int> <dbl>
#> 1 Freelancer Austria 16 0.0773 16 0.0773
#> 2 Full-time Austria 165 0.797 181 0.874
#> 3 Part-time Austria 26 0.126 207 1
#> 4 Freelancer Denmark 85 0.226 85 0.226
#> 5 Full-time Denmark 275 0.731 360 0.957
#> 6 Part-time Denmark 16 0.0426 376 1
#> 7 Freelancer Germany 29 0.168 29 0.168
#> 8 Full-time Germany 139 0.803 168 0.971
#> 9 Part-time Germany 5 0.0289 173 1
#> 10 Freelancer Switzerland 10 0.0429 10 0.0429
#> 11 Full-time Switzerland 154 0.661 164 0.704
#> 12 Part-time Switzerland 69 0.296 233 1
#> 13 Freelancer UK 32 0.152 32 0.152
#> 14 Full-time UK 169 0.801 201 0.953
#> 15 Part-time UK 10 0.0474 211 1
(Compare the columns percent
, cum_n
and cum_percent
with the output above.)