Getting Started

Discovering

You can discover datasets with pin_find(), which by default will search for data inside CRAN packages. The places where pins can find or store resources are referred to as ‘boards’. There are multiple boards available but they require you to configure them so we will leave those for later on.

As a quick example, let’s search for resources that may contain ‘boston housing’:

library(pins)
pin_find("boston housing")

# A tibble: 8 x 4
  name                  description                                            type  board  
  <chr>                 <chr>                                                  <chr> <chr>  
1 A3/housing            Boston Housing Prices from A3 package.                 table packag…
2 BiDAG/Boston          Boston housing data from BiDAG package.                table packag…
3 GSE/boston            Boston Housing Data from GSE package.                  table packag…
4 KernelKnn/Boston      Boston Housing Data (Regression) from KernelKnn packa… table packag…
5 mlbench/BostonHousing Boston Housing Data from mlbench package.              table packag…
6 pdp/boston            Boston Housing Data from pdp package.                  table packag…
7 spData/boston         Corrected Boston Housing Data from spData package.     table packag…
8 spikeslab/housingI    Boston Housing Interaction Data from spikeslab packag… table packag…

We’ve found out that the BSDA package contains a Housing dataset, you can then retrieve this dataset using pin_get() as follows:

pin_get("BSDA/Housing")

# A tibble: 74 x 3
   city       year   price
   <chr>      <fct>  <int>
 1 Albany     1984   52400
 2 Anaheim    1984  134900
 3 Atlanta    1984   64600
 4 Baltimore  1984   65200
 5 Birmingham 1984   66600
 6 Boston     1984  102000
 7 Chicago    1984   77500
 8 Cincinnati 1984   59600
 9 Cleveland  1984   65600
10 Columbus   1984   60400
# … with 64 more rows

Most datasets in CRAN contain rectangular data, which pins knows to load as a data frame. Other boards might contain non-rectangular datasets which pin_get() also supports. More on this later on, but first, lets introduce caching.

Caching

Let’s suppose that the ‘home prices’ dataset is not exactly what we are looking for, we can try to search online for ‘home prices’ and find out that catalog.data.gov contains a more suitable FHFA House Price Indexes dataset. Instead of giving users explicit instructions to download the CSV file, we can instead use pin() to cache this dataset locally:

pin("http://www.fhfa.gov/datatools/downloads/documents/hpi/hpi_master.csv")

[1] "/Users/username/Library/Caches/pins/local/HPI_master/HPI_master.csv"

Notice that the pin returns a path to a local CSV file, which you are free to load with your favorite package.

library(readr)
pin("http://www.fhfa.gov/datatools/downloads/documents/hpi/hpi_master.csv") %>%
  read_csv(col_types = cols())

# A tibble: 108,826 x 10
   hpi_type hpi_flavor frequency level place_name place_id    yr period index_nsa
   <chr>    <chr>      <chr>     <chr> <chr>      <chr>    <dbl>  <dbl>     <dbl>
 1 traditi… purchase-… monthly   USA … East Nort… DV_ENC    1991      1      100 
 2 traditi… purchase-… monthly   USA … East Nort… DV_ENC    1991      2      101.
 3 traditi… purchase-… monthly   USA … East Nort… DV_ENC    1991      3      101.
 4 traditi… purchase-… monthly   USA … East Nort… DV_ENC    1991      4      102.
 5 traditi… purchase-… monthly   USA … East Nort… DV_ENC    1991      5      102.
 6 traditi… purchase-… monthly   USA … East Nort… DV_ENC    1991      6      103.
 7 traditi… purchase-… monthly   USA … East Nort… DV_ENC    1991      7      103.
 8 traditi… purchase-… monthly   USA … East Nort… DV_ENC    1991      8      103.
 9 traditi… purchase-… monthly   USA … East Nort… DV_ENC    1991      9      103.
10 traditi… purchase-… monthly   USA … East Nort… DV_ENC    1991     10      103.
# … with 108,816 more rows, and 1 more variable: index_sa <dbl>

The pins package tries to be smart about downloading files only when the file has changed, you can skip the details on how this works but you should know you can set download = TRUE to force pins to download the file again even if we think it has not changed. The specific mechanisms pins uses are based on HTTP headers like cache-control and ETag to avoid downloading files when they have not changed or when the cache has not expired.

Notice that pin() assigned a name automatically, HPI_master for the previous example; however, you can choose your own name and be explicit about retrieving a pin with pin_get():

pin("http://www.fhfa.gov/datatools/downloads/documents/hpi/hpi_master.csv",
    name = "home_price_indexes")

pin_get("home_price_indexes") %>%
  read_csv(col_types = cols())

# A tibble: 108,826 x 10
   hpi_type hpi_flavor frequency level place_name place_id    yr period index_nsa
   <chr>    <chr>      <chr>     <chr> <chr>      <chr>    <dbl>  <dbl>     <dbl>
 1 traditi… purchase-… monthly   USA … East Nort… DV_ENC    1991      1      100 
 2 traditi… purchase-… monthly   USA … East Nort… DV_ENC    1991      2      101.
 3 traditi… purchase-… monthly   USA … East Nort… DV_ENC    1991      3      101.
 4 traditi… purchase-… monthly   USA … East Nort… DV_ENC    1991      4      102.
 5 traditi… purchase-… monthly   USA … East Nort… DV_ENC    1991      5      102.
 6 traditi… purchase-… monthly   USA … East Nort… DV_ENC    1991      6      103.
 7 traditi… purchase-… monthly   USA … East Nort… DV_ENC    1991      7      103.
 8 traditi… purchase-… monthly   USA … East Nort… DV_ENC    1991      8      103.
 9 traditi… purchase-… monthly   USA … East Nort… DV_ENC    1991      9      103.
10 traditi… purchase-… monthly   USA … East Nort… DV_ENC    1991     10      103.
# … with 108,816 more rows, and 1 more variable: index_sa <dbl>

You can enable download progress with the pins.progress option and print additional caching information using the pins.verebose option:

# print verbose upload and download progress
options(pins.progress = TRUE)

# print verbose pins info
options(pins.verbose = TRUE)

Sharing

After performing a data analysis, you might want to share your dataset with others, which you can achieve using pin(data, board = "<board-name>").

There are multiple boards available, one of them is the “local” board which pins uses by default. A “local” board can help you share pins with other R sessions and tools using a well-known cache folder in your local computer defined in the rappdirs package. Notice that this board is available by default:

board_list()

[1] "local"    "packages"

You can also name your boards using the ‘name’ parameter, when a name is not specified, the pins package will simply name your board with the kind of board you are using, ‘local’ in the previous example.

The following example stores a simple data analysis over home prices as ‘home_price_analysis’ in the ‘local’ board.

pin_get("home_price_indexes") %>%
  read_csv(col_types = cols()) %>%
  dplyr::group_by(yr) %>%
  dplyr::count() %>%
  pin("home_price_analysis")

# A tibble: 45 x 2
      yr     n
   <dbl> <int>
 1  1975   274
 2  1976   383
 3  1977   524
 4  1978   689
 5  1979   812
 6  1980   894
 7  1981   886
 8  1982   898
 9  1983  1005
10  1984  1088
# … with 35 more rows

The local board allows you to share pins with other R sessions or even other Python sessions, to share with other people or across different computers, you can consider using the github, rsconnect or kaggle boards; these boards will be introduced in the Understanding Boards article.

Before we get to that, the Using Pins with RStudio article presents a few enhancements available when using pins in RStudio.