Basic usage

Motivation

One of the things that Stata has that RStudio lacks is the variable explorer. This extremely useful especially if you’re working with datasets with a large number of variables with hard to remember names, but descriptive labels. In Stata you just search in the variable explorer and then click on the variable to get its name into the console.

Stata’s variable explorer

As an example of a dataset like this, consider the Quality of Government standard dataset. Here’s the 2018 version of the cross-section data:

qog <- rio::import("http://www.qogdata.pol.gu.se/dataarchive/qog_std_cs_jan18.dta")

It has 194 observations (different countries) and 1882 variables.

The variables have names like wdi_acelu, bci_bci, eu_eco2gdpeurhab, gle_cgdpc etc. Not exactly things you want to remember.

Working with this in Stata is relatively easy because you just search in the variable explorer for things like “sanitation”, “corruption”, “GDP”, etc. and you find the variable names.

Unfortunately, RStudio doesn’t have a variable explorer panel. But you can improvise something like the following:

data.frame(Description = sjlabelled::get_label(qog)) %>% DT::datatable()

BAM! We just made a variable explorer! If you run this code in the console it opens the DT::datatable in the RStudio’s Viewer pane, which is pretty much replicating the Stata experience (except that it is read-only).

But we can do better! Why not include additional information, like the number of missing observations, summary statistics, or an overview of the values of each variable?

Introducing `vars_explore`

Full usage

vars_explore(qog)

This will create a searchable variable explorer, and calculate summary statistics for each variable:

The table is searchable, and you can, furthermore, arrange it, say, based on which variable has least missing values. For instance, search for “GDP per capita” and see which variable provides most complete information.

If you click on the + next to a variable, you will get both the summary statistics, and unique values and, if present, the value labels. The option value.labels.width limits how many characters to include in the “Value labels” and “Values” columns. Default is 500.

If you set minimal = TRUE, only “Variable”, “Description”, “Obs.”, and “Missing” will be shown (and none the summary stats will be calculated).

Creating summary statistics tables

By default, vars_explore only shows the summary in RStudio’s Viewer Pane, and doesn’t return anything. But you can change this by setting viewer = FALSE and silent = FALSE. This speeds it us massively because it’s the DT::datatable() that takes most of the time, and allows you to build summary stats (e.g. for a paper):

vdem_summary <- qog %>% 
  select(starts_with("vdem_")) %>%
  vars_explore(viewer = FALSE, silent = FALSE) %>% 
  select(-Values, -`Value labels`)

knitr::kable(vdem_summary)

Variable	Description	Type	Obs.	Missing	Mean	Median	Std.Dev.	Min	Max
vdem_corr	Political corruption index	numeric	168	26	0.51	0.57	0.29	0.010031	0.968504
vdem_delibdem	Deliberative democracy index	numeric	168	26	0.41	0.39	0.26	0.002548	0.912120
vdem_dl_delib	Deliberative component index	numeric	168	26	0.66	0.74	0.25	0.020510	0.989109
vdem_edcomp_thick	Electoral component index	numeric	168	26	0.62	0.65	0.24	0.004832	0.939636
vdem_egal	Egalitarian component index	numeric	168	26	0.62	0.63	0.20	0.086303	0.969009
vdem_egaldem	Egalitarian democracy index	numeric	168	26	0.42	0.36	0.23	0.043189	0.869084
vdem_elvotbuy	Election vote buying	numeric	162	32	2.09	1.82	1.03	0.211176	3.982266
vdem_exbribe	Executive bribery and corrupt exchanges	numeric	168	26	1.93	1.79	1.06	0.097352	3.938878
vdem_excrptps	Public sector corrupt exchanges	numeric	168	26	1.82	1.69	1.01	0.113801	3.937302
vdem_execorr	Executive corruption index	numeric	168	26	0.48	0.52	0.30	0.010515	0.974874
vdem_exembez	Executive embezzlement and theft	numeric	168	26	2.13	2.11	1.11	0.091897	3.924967
vdem_exthftps	Public sector theft	numeric	168	26	0.06	-0.13	1.49	-3.137724	3.399789
vdem_gcrrpt	Legislature corrupt activities	numeric	168	26	1.85	1.67	0.89	0.299263	3.894069
vdem_gender	Women political empowerment index	numeric	168	26	0.74	0.79	0.17	0.274245	0.956486
vdem_jucorrdc	Judicial corruption decision	numeric	168	26	2.26	2.25	0.85	0.475427	3.825114
vdem_libdem	Liberal democracy index	numeric	168	26	0.43	0.40	0.26	0.015430	0.886297
vdem_liberal	Liberal component index	numeric	168	26	0.62	0.68	0.25	0.037652	0.975519
vdem_mecorrpt	Media corrupt	numeric	168	26	2.54	2.62	0.97	0.079457	3.966262
vdem_partip	Participatory component index	numeric	168	26	0.47	0.52	0.16	0.042004	0.711231
vdem_partipdem	Participatory democracy index	numeric	168	26	0.34	0.34	0.19	0.014923	0.672246
vdem_polyarchy	Electoral democracy index	numeric	168	26	0.55	0.57	0.24	0.026478	0.924651
vdem_pubcorr	Public sector corruption index	numeric	168	26	0.50	0.52	0.30	0.005353	0.981405

You can also opt to use View() instead of the Viewer Pane, which works much faster than DT::datatable(), although, given how RStudio works, this makes hard to see both the variable explorer and your script at the same time.

Alternatives

The best alternative to vars_explore is the vtable package. The downsides of vtable are

it doesn’t provide a searchable table in the Viewer Pane, and
it puts all summary stats in a single text column. This makes it hard to sort based on, say, the number of missing values.

You can, however, use vtable to generate a dataframe that can be opened with View(), just like you can with vars_explore(silent = FALSE, viewer = FALSE). Unfortunately, what makes vtable faster is precisely its limitations, as the most time consuming part is loading up the DT::datatable, rather than calculating anything. vtable works fast because it creates a simple HTML file, but that is not searchable in the Viewer Pane.

Another alternative is sjPlot::view_df, which provides summary stats in individual columns, but it is very slow. Also, like vtable, it doesn’t provide a searchable table in the Viewer Pane.