This vignette shows how to download data from Dataverse using the dataverse package. We’ll focus on a Dataverse repository that contains supplemental files for Jamie Monogan’s book Political Analysis Using R, which is stored at Harvard University’s IQSS Dataverse Network:
Monogan, Jamie, 2015, “Political Analysis Using R: Example Code and Data, Plus Data for Practice Problems”, doi:10.7910/DVN/ARKOTI, Harvard Dataverse, V1, UNF:6:+itU9hcUJ8I9E0Kqv8HWHg==
This study is persistently retrievable by a “Digital Object Identifier (DOI)”: https://doi.org/10.7910/DVN/ARKOTI and the citation above (taken from the Dataverse page) includes a “Universal Numeric Fingerprint (UNF)”: UNF:6:+itU9hcUJ8I9E0Kqv8HWHg==
, which provides a versioned, multi-file hash for the entire study, which contains 32 files.
If you don’t already know what datasets and files you want to use from Dataverse, see the “Data Search” vignette for guidance on data search and discovery.
We will download these files and examine them directly in R using the dataverse package. To begin, we need to loading the package and using the get_dataset()
function to retrieve some basic metadata about the dataset:
library("dataverse")
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
(dataset <- get_dataset("doi:10.7910/DVN/ARKOTI"))
## Dataset (75170):
## Version: 1.0, RELEASED
## Release Date: 2015-07-07T02:57:02Z
## License: CC0
## 17 Files:
## label version id contentType
## 1 alpl2013.tab 2 2692294 text/tab-separated-values
## 2 BPchap7.tab 2 2692295 text/tab-separated-values
## 3 chapter01.R 2 2692202 text/plain; charset=US-ASCII
## 4 chapter02.R 2 2692206 text/plain; charset=US-ASCII
## 5 chapter03.R 2 2692210 text/plain; charset=US-ASCII
## 6 chapter04.R 2 2692204 text/plain; charset=US-ASCII
## 7 chapter05.R 2 2692205 text/plain; charset=US-ASCII
## 8 chapter06.R 2 2692212 text/plain; charset=US-ASCII
## 9 chapter07.R 2 2692209 text/plain; charset=US-ASCII
## 10 chapter08.R 2 2692208 text/plain; charset=US-ASCII
## 11 chapter09.R 2 2692211 text/plain; charset=US-ASCII
## 12 chapter10.R 1 2692203 text/plain; charset=US-ASCII
## 13 chapter11.R 1 2692207 text/plain; charset=US-ASCII
## 14 comprehensiveJapanEnergy.tab 2 2692296 text/tab-separated-values
## 15 constructionData.tab 2 2692293 text/tab-separated-values
## 16 drugCoverage.csv 1 2692233 text/plain; charset=US-ASCII
## 17 hanmerKalkanANES.tab 2 2692290 text/tab-separated-values
## 18 hmnrghts.tab 2 2692298 text/tab-separated-values
## 19 hmnrghts.txt 1 2692238 text/plain
## 20 levant.tab 2 2692289 text/tab-separated-values
## 21 LL.csv 1 2692228 text/plain; charset=US-ASCII
## 22 moneyDem.tab 2 2692292 text/tab-separated-values
## 23 owsiakJOP2013.tab 2 2692297 text/tab-separated-values
## 24 PESenergy.csv 1 2692230 text/plain; charset=US-ASCII
## 25 pts1994.csv 1 2692229 text/plain; charset=US-ASCII
## 26 pts1995.csv 1 2692231 text/plain; charset=US-ASCII
## 27 sen113kh.ord 1 2692239 text/plain; charset=US-ASCII
## 28 SinghEJPR.tab 2 2692299 text/tab-separated-values
## 29 SinghJTP.tab 2 2692288 text/tab-separated-values
## 30 stdSingh.tab 2 2692291 text/tab-separated-values
## 31 UN.csv 1 2692232 text/plain; charset=US-ASCII
## 32 war1800.tab 2 2692300 text/tab-separated-values
The output prints some basic metadata and then the str()
of the files
data frame returned by the call. This lists all of the files in the dataset along with a considerable amount of metadata about each. We can see a quick glance at these files using:
dataset$files[c("filename", "contentType")]
This shows that there are indeed 32 files, a mix of .R code files and tab- and comma-separated data files.
You can also retrieve more extensive metadata using dataset_metadata()
:
str(dataset_metadata("doi:10.7910/DVN/ARKOTI"), 1)
## List of 2
## $ displayName: chr "Citation Metadata"
## $ fields :'data.frame': 7 obs. of 4 variables:
We’ll focus here on the code and data files for Chapter 2 from the book.
Let’s start by grabbing the code using get_file()
(note that this always returns a raw vector):
code3 <- get_file("chapter03.R", "doi:10.7910/DVN/ARKOTI")
writeBin(code3, "chapter03.R")
Now we’ll get the corresponding data and save it locally. For this code we need two data files:
writeBin(get_file("constructionData.tab", "doi:10.7910/DVN/ARKOTI"),
"constructionData.dta")
writeBin(get_file("PESenergy.csv", "doi:10.7910/DVN/ARKOTI"),
"PESenergy.csv")
To confirm that the data look the way we want, we can also (perhaps alternatively) load it directly into R:
constructionData <- foreign::read.dta("constructionData.dta")
str(constructionData)
PESenergy <- utils::read.table("PESenergy.csv")
str(PESenergy)
## 'data.frame': 50 obs. of 55 variables:
## $ year : int 1997 1997 1997 1997 1997 1997 1997 1997 1997 1997 ...
## $ stno : int 1 2 3 4 5 6 7 8 9 10 ...
## $ totalreg : int 329 500 314 963 2106 643 634 239 1996 880 ...
## $ totalhealth : int 300 424 263 834 1859 554 501 204 1640 732 ...
## $ raneyfolded97 : num 0.58 0.69 0.85 0.63 0.5 ...
## $ healthagenda97 : int 49 180 137 220 1409 153 324 40 408 157 ...
## $ predictedtotalig : num 51.8 99 81.8 111.2 224.1 ...
## $ supplytotalhealth : int 1168 6991 4666 9194 70014 8847 7845 1438 35363 13471 ...
## $ totalhealthsupplysq : int 136 4887 2177 8453 490196 7827 6154 207 125054 18147 ...
## $ partratetotalhealth : num 2.48 1.09 1.09 1.4 0.35 ...
## $ ighealthcare : int 29 76 51 129 247 89 133 35 356 148 ...
## $ supplydirectpatientcare : int 1137 6687 4458 8785 66960 8320 7439 1365 33793 12760 ...
## $ dpcsupplysq : int 129 4472 1987 7718 448364 6922 5534 186 114197 16282 ...
## $ partratedpc : num 1.14 0.51 0.43 0.68 0.17 ...
## $ igdpcare : int 13 34 19 60 112 40 67 12 212 74 ...
## $ supplypharmprod : int 0 174 78 229 2288 340 202 36 962 360 ...
## $ pharmsupplysq : int 0 30276 6084 52441 5234944 115600 40804 1296 925444 129600 ...
## $ partratepharmprod : num 0 10.34 19.23 5.24 2.05 ...
## $ igpharmprod : int 4 18 15 12 47 23 22 12 46 32 ...
## $ supplybusiness : int 0 51 28 93 315 55 36 14 317 78 ...
## $ businesssupplysq : int 0 2601 784 8649 99225 3025 1296 196 100489 6084 ...
## $ partratebusness : num 0 1.96 14.29 15.05 6.03 ...
## $ igbusiness : int 2 1 4 14 19 5 4 2 25 6 ...
## $ supplygovt : int 14 26 80 23 70 71 105 2 67 176 ...
## $ govsupplysq : num 0.02 0.07 0.64 0.05 0.49 ...
## $ partrategov : num 0 38.5 2.5 30.4 10 ...
## $ iggovt : int 0 10 2 7 7 1 8 0 12 2 ...
## $ supplyadvocacy : int 16 37 14 57 344 54 51 18 206 76 ...
## $ advossq : int 256 1369 196 3249 118336 2916 2601 324 42436 5776 ...
## $ partrateadvo : num 31.25 16.22 28.57 31.58 8.72 ...
## $ ig97advoc : int 5 6 4 18 30 7 9 4 26 17 ...
## $ rnmedschools : int 1 16 8 7 37 7 12 3 18 21 ...
## $ rnmedschoolssq : int 1 256 64 49 1369 49 144 9 324 441 ...
## $ rnmedschoolpartrate : num 100 0 12.5 28.57 5.41 ...
## $ rnmedschooligs : int 1 0 1 2 2 0 1 0 6 1 ...
## $ healthprofessionals : int 12890 128980 82140 122760 749620 111550 121110 22740 471270 215670 ...
## $ healthprofessionalssquared: int 16615 1663584 674698 1507002 56193014 1244340 1466763 51711 22209541 4651355 ...
## $ partrateprofessionals : num 0.03 0.01 0.01 0.01 0 ...
## $ ighealthprofessionals : int 4 7 6 16 30 13 22 5 29 16 ...
## $ predictdpcpartrate : num 1.175 0.915 1.016 0.826 0.348 ...
## $ predictdpcig : num 23.1 49.7 39.4 58.8 103.5 ...
## $ predictprofpartrate : num 0.02475 0.01383 0.01788 0.01434 0.00579 ...
## $ predictprofig : num 7.59 12.58 10.69 12.34 22.47 ...
## $ predictmedschoolparttrate : num 17.39 8.08 12.3 12.95 5.02 ...
## $ predictmedschoolig : num 0.355 1.269 0.774 0.713 2.65 ...
## $ predictadvopartrate : num 31.9 26.4 32.5 21.6 13 ...
## $ predictadvoig : num 5.96 7.98 5.76 9.83 28.53 ...
## $ predictbuspartrate : num 25.78 18.08 21.33 13.1 7.27 ...
## $ predictbusig : num 2.58 7.96 5.66 11.66 20.04 ...
## $ predictpharmpartrate : num 21.38 15.22 18.52 13.44 4.14 ...
## $ predictpharmig : num 11.3 18.1 14.4 20.1 45.1 ...
## $ predictgovpartrate : num 14.41 12.61 5.84 13.03 6.93 ...
## $ predictgovig : num 2.06 2.43 3.78 2.35 3.57 ...
## $ predicttotalpartrate : num 2.41 1.823 2.047 1.623 0.752 ...
## $ predicttotalig : num 54.2 99.2 81.9 114.8 228.3 ...
## - attr(*, "datalabel")= chr ""
## - attr(*, "time.stamp")= chr " 1 Jun 2013 16:59"
## - attr(*, "formats")= chr "%8.0g" "%8.0g" "%8.0g" "%8.0g" ...
## - attr(*, "types")= int 252 251 252 252 254 252 254 253 253 254 ...
## - attr(*, "val.labels")= chr "" "" "" "" ...
## - attr(*, "var.labels")= chr "Year" "StNo." "97 TotalReg" "97Total-Health" ...
## - attr(*, "version")= int 12
## 'data.frame': 181 obs. of 1 variable:
## $ V1: Factor w/ 181 levels "Apr-69,5,3.4,60,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,39.2",..: 31 62 47 107 1 122 92 77 16 167 ...
In addition to visual inspection, we can compare the UNF signatures for each dataset against what is reported by Dataverse to confirm that we received the correct files:
library("UNF")
unf(constructionData)
unf(PESenergy)
dataset$files[c("label", "UNF")]
## UNF6:+4pc5114xS0ryr1sSvdX6g==
## UNF6:TD7TEMZyrX4iGTlTsUKQDg==
## label UNF
## 1 alpl2013.tab UNF:6:d9ZNXvmiPfiunSAiXRpVfg==
## 2 BPchap7.tab UNF:6:B3/HJbnzktaX5eEJA2ItiA==
## 3 chapter01.R <NA>
## 4 chapter02.R <NA>
## 5 chapter03.R <NA>
## 6 chapter04.R <NA>
## 7 chapter05.R <NA>
## 8 chapter06.R <NA>
## 9 chapter07.R <NA>
## 10 chapter08.R <NA>
## 11 chapter09.R <NA>
## 12 chapter10.R <NA>
## 13 chapter11.R <NA>
## 14 comprehensiveJapanEnergy.tab UNF:6:Vhb3oZb9m4Nk9N7s6UAHGg==
## 15 constructionData.tab UNF:6:+4pc5114xS0ryr1sSvdX6g==
## 16 drugCoverage.csv <NA>
## 17 hanmerKalkanANES.tab UNF:6:lrQrhDAXFc8lSRP9muJslw==
## 18 hmnrghts.tab UNF:6:uEg24jBA2ht0P4WeNLjI+w==
## 19 hmnrghts.txt <NA>
## 20 levant.tab UNF:6:zlgG7+JXsIZYvS383eQOvA==
## 21 LL.csv <NA>
## 22 moneyDem.tab UNF:6:7M/QM5i6IM/VUM94UJjJUQ==
## 23 owsiakJOP2013.tab UNF:6:0ZEvCFuUQms2zYD57hmwNQ==
## 24 PESenergy.csv <NA>
## 25 pts1994.csv <NA>
## 26 pts1995.csv <NA>
## 27 sen113kh.ord <NA>
## 28 SinghEJPR.tab UNF:6:iDGp9dXOl4SiR+rCBWo8Tw==
## 29 SinghJTP.tab UNF:6:lDCyZ7YQF5O++SRsxh2kGA==
## 30 stdSingh.tab UNF:6:A5gwtn5q/ewkTMpcQEQ73w==
## 31 UN.csv <NA>
## 32 war1800.tab UNF:6:jJ++mepKcv9JbJOOPLMf2Q==
To reproduce the analysis, we can simply run the code file either as a system()
call or directly in R using source()
(note this particular file begins with an rm()
call so you may want to run it in a clean R session):
system("Rscript chapter03.R")
source("chapter03.R")
Any well-produced set of analysis reproduction files, like this one, will easily run without error once the data and code are in-hand.
To archive your own reproducible analyses using Dataverse, see the “Archiving Data” vignette.