Data Retrieval and Reuse

This vignette shows how to download data from Dataverse using the dataverse package. We’ll focus on a Dataverse repository that contains supplemental files for Jamie Monogan’s book Political Analysis Using R, which is stored at Harvard University’s IQSS Dataverse Network:

Monogan, Jamie, 2015, “Political Analysis Using R: Example Code and Data, Plus Data for Practice Problems”, doi:10.7910/DVN/ARKOTI, Harvard Dataverse, V1, UNF:6:+itU9hcUJ8I9E0Kqv8HWHg==

This study is persistently retrievable by a “Digital Object Identifier (DOI)”: https://doi.org/10.7910/DVN/ARKOTI and the citation above (taken from the Dataverse page) includes a “Universal Numeric Fingerprint (UNF)”: UNF:6:+itU9hcUJ8I9E0Kqv8HWHg==, which provides a versioned, multi-file hash for the entire study, which contains 32 files.

If you don’t already know what datasets and files you want to use from Dataverse, see the “Data Search” vignette for guidance on data search and discovery.

Retrieving Dataset and File Metadata

We will download these files and examine them directly in R using the dataverse package. To begin, we need to loading the package and using the get_dataset() function to retrieve some basic metadata about the dataset:

library("dataverse")
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
(dataset <- get_dataset("doi:10.7910/DVN/ARKOTI"))

## Dataset (75170): 
## Version: 1.0, RELEASED
## Release Date: 2015-07-07T02:57:02Z
## License: CC0
## 17 Files:
##                           label version      id                  contentType
## 1                  alpl2013.tab       2 2692294    text/tab-separated-values
## 2                   BPchap7.tab       2 2692295    text/tab-separated-values
## 3                   chapter01.R       2 2692202 text/plain; charset=US-ASCII
## 4                   chapter02.R       2 2692206 text/plain; charset=US-ASCII
## 5                   chapter03.R       2 2692210 text/plain; charset=US-ASCII
## 6                   chapter04.R       2 2692204 text/plain; charset=US-ASCII
## 7                   chapter05.R       2 2692205 text/plain; charset=US-ASCII
## 8                   chapter06.R       2 2692212 text/plain; charset=US-ASCII
## 9                   chapter07.R       2 2692209 text/plain; charset=US-ASCII
## 10                  chapter08.R       2 2692208 text/plain; charset=US-ASCII
## 11                  chapter09.R       2 2692211 text/plain; charset=US-ASCII
## 12                  chapter10.R       1 2692203 text/plain; charset=US-ASCII
## 13                  chapter11.R       1 2692207 text/plain; charset=US-ASCII
## 14 comprehensiveJapanEnergy.tab       2 2692296    text/tab-separated-values
## 15         constructionData.tab       2 2692293    text/tab-separated-values
## 16             drugCoverage.csv       1 2692233 text/plain; charset=US-ASCII
## 17         hanmerKalkanANES.tab       2 2692290    text/tab-separated-values
## 18                 hmnrghts.tab       2 2692298    text/tab-separated-values
## 19                 hmnrghts.txt       1 2692238                   text/plain
## 20                   levant.tab       2 2692289    text/tab-separated-values
## 21                       LL.csv       1 2692228 text/plain; charset=US-ASCII
## 22                 moneyDem.tab       2 2692292    text/tab-separated-values
## 23            owsiakJOP2013.tab       2 2692297    text/tab-separated-values
## 24                PESenergy.csv       1 2692230 text/plain; charset=US-ASCII
## 25                  pts1994.csv       1 2692229 text/plain; charset=US-ASCII
## 26                  pts1995.csv       1 2692231 text/plain; charset=US-ASCII
## 27                 sen113kh.ord       1 2692239 text/plain; charset=US-ASCII
## 28                SinghEJPR.tab       2 2692299    text/tab-separated-values
## 29                 SinghJTP.tab       2 2692288    text/tab-separated-values
## 30                 stdSingh.tab       2 2692291    text/tab-separated-values
## 31                       UN.csv       1 2692232 text/plain; charset=US-ASCII
## 32                  war1800.tab       2 2692300    text/tab-separated-values

The output prints some basic metadata and then the str() of the files data frame returned by the call. This lists all of the files in the dataset along with a considerable amount of metadata about each. We can see a quick glance at these files using:

dataset$files[c("filename", "contentType")]

This shows that there are indeed 32 files, a mix of .R code files and tab- and comma-separated data files.

You can also retrieve more extensive metadata using dataset_metadata():

str(dataset_metadata("doi:10.7910/DVN/ARKOTI"), 1)

## List of 2
##  $ displayName: chr "Citation Metadata"
##  $ fields     :'data.frame': 7 obs. of  4 variables:

We’ll focus here on the code and data files for Chapter 2 from the book.

Retrieving Files

Let’s start by grabbing the code using get_file() (note that this always returns a raw vector):

code3 <- get_file("chapter03.R", "doi:10.7910/DVN/ARKOTI")
writeBin(code3, "chapter03.R")

Now we’ll get the corresponding data and save it locally. For this code we need two data files:

writeBin(get_file("constructionData.tab", "doi:10.7910/DVN/ARKOTI"), 
         "constructionData.dta")
writeBin(get_file("PESenergy.csv", "doi:10.7910/DVN/ARKOTI"), 
         "PESenergy.csv")

To confirm that the data look the way we want, we can also (perhaps alternatively) load it directly into R:

constructionData <- foreign::read.dta("constructionData.dta")
str(constructionData)
PESenergy <- utils::read.table("PESenergy.csv")
str(PESenergy)

## 'data.frame':    50 obs. of  55 variables:
##  $ year                      : int  1997 1997 1997 1997 1997 1997 1997 1997 1997 1997 ...
##  $ stno                      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ totalreg                  : int  329 500 314 963 2106 643 634 239 1996 880 ...
##  $ totalhealth               : int  300 424 263 834 1859 554 501 204 1640 732 ...
##  $ raneyfolded97             : num  0.58 0.69 0.85 0.63 0.5 ...
##  $ healthagenda97            : int  49 180 137 220 1409 153 324 40 408 157 ...
##  $ predictedtotalig          : num  51.8 99 81.8 111.2 224.1 ...
##  $ supplytotalhealth         : int  1168 6991 4666 9194 70014 8847 7845 1438 35363 13471 ...
##  $ totalhealthsupplysq       : int  136 4887 2177 8453 490196 7827 6154 207 125054 18147 ...
##  $ partratetotalhealth       : num  2.48 1.09 1.09 1.4 0.35 ...
##  $ ighealthcare              : int  29 76 51 129 247 89 133 35 356 148 ...
##  $ supplydirectpatientcare   : int  1137 6687 4458 8785 66960 8320 7439 1365 33793 12760 ...
##  $ dpcsupplysq               : int  129 4472 1987 7718 448364 6922 5534 186 114197 16282 ...
##  $ partratedpc               : num  1.14 0.51 0.43 0.68 0.17 ...
##  $ igdpcare                  : int  13 34 19 60 112 40 67 12 212 74 ...
##  $ supplypharmprod           : int  0 174 78 229 2288 340 202 36 962 360 ...
##  $ pharmsupplysq             : int  0 30276 6084 52441 5234944 115600 40804 1296 925444 129600 ...
##  $ partratepharmprod         : num  0 10.34 19.23 5.24 2.05 ...
##  $ igpharmprod               : int  4 18 15 12 47 23 22 12 46 32 ...
##  $ supplybusiness            : int  0 51 28 93 315 55 36 14 317 78 ...
##  $ businesssupplysq          : int  0 2601 784 8649 99225 3025 1296 196 100489 6084 ...
##  $ partratebusness           : num  0 1.96 14.29 15.05 6.03 ...
##  $ igbusiness                : int  2 1 4 14 19 5 4 2 25 6 ...
##  $ supplygovt                : int  14 26 80 23 70 71 105 2 67 176 ...
##  $ govsupplysq               : num  0.02 0.07 0.64 0.05 0.49 ...
##  $ partrategov               : num  0 38.5 2.5 30.4 10 ...
##  $ iggovt                    : int  0 10 2 7 7 1 8 0 12 2 ...
##  $ supplyadvocacy            : int  16 37 14 57 344 54 51 18 206 76 ...
##  $ advossq                   : int  256 1369 196 3249 118336 2916 2601 324 42436 5776 ...
##  $ partrateadvo              : num  31.25 16.22 28.57 31.58 8.72 ...
##  $ ig97advoc                 : int  5 6 4 18 30 7 9 4 26 17 ...
##  $ rnmedschools              : int  1 16 8 7 37 7 12 3 18 21 ...
##  $ rnmedschoolssq            : int  1 256 64 49 1369 49 144 9 324 441 ...
##  $ rnmedschoolpartrate       : num  100 0 12.5 28.57 5.41 ...
##  $ rnmedschooligs            : int  1 0 1 2 2 0 1 0 6 1 ...
##  $ healthprofessionals       : int  12890 128980 82140 122760 749620 111550 121110 22740 471270 215670 ...
##  $ healthprofessionalssquared: int  16615 1663584 674698 1507002 56193014 1244340 1466763 51711 22209541 4651355 ...
##  $ partrateprofessionals     : num  0.03 0.01 0.01 0.01 0 ...
##  $ ighealthprofessionals     : int  4 7 6 16 30 13 22 5 29 16 ...
##  $ predictdpcpartrate        : num  1.175 0.915 1.016 0.826 0.348 ...
##  $ predictdpcig              : num  23.1 49.7 39.4 58.8 103.5 ...
##  $ predictprofpartrate       : num  0.02475 0.01383 0.01788 0.01434 0.00579 ...
##  $ predictprofig             : num  7.59 12.58 10.69 12.34 22.47 ...
##  $ predictmedschoolparttrate : num  17.39 8.08 12.3 12.95 5.02 ...
##  $ predictmedschoolig        : num  0.355 1.269 0.774 0.713 2.65 ...
##  $ predictadvopartrate       : num  31.9 26.4 32.5 21.6 13 ...
##  $ predictadvoig             : num  5.96 7.98 5.76 9.83 28.53 ...
##  $ predictbuspartrate        : num  25.78 18.08 21.33 13.1 7.27 ...
##  $ predictbusig              : num  2.58 7.96 5.66 11.66 20.04 ...
##  $ predictpharmpartrate      : num  21.38 15.22 18.52 13.44 4.14 ...
##  $ predictpharmig            : num  11.3 18.1 14.4 20.1 45.1 ...
##  $ predictgovpartrate        : num  14.41 12.61 5.84 13.03 6.93 ...
##  $ predictgovig              : num  2.06 2.43 3.78 2.35 3.57 ...
##  $ predicttotalpartrate      : num  2.41 1.823 2.047 1.623 0.752 ...
##  $ predicttotalig            : num  54.2 99.2 81.9 114.8 228.3 ...
##  - attr(*, "datalabel")= chr ""
##  - attr(*, "time.stamp")= chr " 1 Jun 2013 16:59"
##  - attr(*, "formats")= chr  "%8.0g" "%8.0g" "%8.0g" "%8.0g" ...
##  - attr(*, "types")= int  252 251 252 252 254 252 254 253 253 254 ...
##  - attr(*, "val.labels")= chr  "" "" "" "" ...
##  - attr(*, "var.labels")= chr  "Year" "StNo." "97 TotalReg" "97Total-Health" ...
##  - attr(*, "version")= int 12
## 'data.frame':    181 obs. of  1 variable:
##  $ V1: Factor w/ 181 levels "Apr-69,5,3.4,60,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,39.2",..: 31 62 47 107 1 122 92 77 16 167 ...

In addition to visual inspection, we can compare the UNF signatures for each dataset against what is reported by Dataverse to confirm that we received the correct files:

library("UNF")
unf(constructionData)
unf(PESenergy)
dataset$files[c("label", "UNF")]

## UNF6:+4pc5114xS0ryr1sSvdX6g== 
## UNF6:TD7TEMZyrX4iGTlTsUKQDg== 
##                           label                            UNF
## 1                  alpl2013.tab UNF:6:d9ZNXvmiPfiunSAiXRpVfg==
## 2                   BPchap7.tab UNF:6:B3/HJbnzktaX5eEJA2ItiA==
## 3                   chapter01.R                           <NA>
## 4                   chapter02.R                           <NA>
## 5                   chapter03.R                           <NA>
## 6                   chapter04.R                           <NA>
## 7                   chapter05.R                           <NA>
## 8                   chapter06.R                           <NA>
## 9                   chapter07.R                           <NA>
## 10                  chapter08.R                           <NA>
## 11                  chapter09.R                           <NA>
## 12                  chapter10.R                           <NA>
## 13                  chapter11.R                           <NA>
## 14 comprehensiveJapanEnergy.tab UNF:6:Vhb3oZb9m4Nk9N7s6UAHGg==
## 15         constructionData.tab UNF:6:+4pc5114xS0ryr1sSvdX6g==
## 16             drugCoverage.csv                           <NA>
## 17         hanmerKalkanANES.tab UNF:6:lrQrhDAXFc8lSRP9muJslw==
## 18                 hmnrghts.tab UNF:6:uEg24jBA2ht0P4WeNLjI+w==
## 19                 hmnrghts.txt                           <NA>
## 20                   levant.tab UNF:6:zlgG7+JXsIZYvS383eQOvA==
## 21                       LL.csv                           <NA>
## 22                 moneyDem.tab UNF:6:7M/QM5i6IM/VUM94UJjJUQ==
## 23            owsiakJOP2013.tab UNF:6:0ZEvCFuUQms2zYD57hmwNQ==
## 24                PESenergy.csv                           <NA>
## 25                  pts1994.csv                           <NA>
## 26                  pts1995.csv                           <NA>
## 27                 sen113kh.ord                           <NA>
## 28                SinghEJPR.tab UNF:6:iDGp9dXOl4SiR+rCBWo8Tw==
## 29                 SinghJTP.tab UNF:6:lDCyZ7YQF5O++SRsxh2kGA==
## 30                 stdSingh.tab UNF:6:A5gwtn5q/ewkTMpcQEQ73w==
## 31                       UN.csv                           <NA>
## 32                  war1800.tab UNF:6:jJ++mepKcv9JbJOOPLMf2Q==

Reusing Files and Reproducing Analysis

To reproduce the analysis, we can simply run the code file either as a system() call or directly in R using source() (note this particular file begins with an rm() call so you may want to run it in a clean R session):

system("Rscript chapter03.R")
source("chapter03.R")

Any well-produced set of analysis reproduction files, like this one, will easily run without error once the data and code are in-hand.

To archive your own reproducible analyses using Dataverse, see the “Archiving Data” vignette.

Data Retrieval and Reuse

2017-06-15

Retrieving Dataset and File Metadata

Retrieving Files

Reusing Files and Reproducing Analysis