Recording queries with {dittodb} for travelling

Mauricio Vargas

2020-07-28

Scope

The present consists in mocking the connection to a real PostgreSQL server that contains a database version of the {nycflights13} dataset (among other databases). See the {nycflights13} vignette for more information about this database.

This example is for you if you ever wondered how to use scripts that you use at the office when you are at home or travelling. Or how to continue developing these scripts while you don’t have an internet connection.

Many of us have to use databases that are only accessible from a local network. The package {dittodb} provides with_mock_db() that wraps the code and makes it possible to run outside the office (or even with no internet access at all!).

Recording queries

Suppose we are asked to analyze the flights to only show flights with planes that have been delayed at least 3 hours.

One would find all the flights that have been delayed by over 3 hours, and then only grab the distinct tail numbers. The only consideration would be to filter those flights with missing tail number or those will be treated as a single plane.

We could run the following code to get that data with a direct connection to the database (i.e. at the office):

library(dplyr)
library(dbplyr)

con_psql <- DBI::dbConnect(
  RPostgreSQL::PostgreSQL(),
  dbname = "travelling",
  host = "127.0.0.1",
  user = "m.ciccone"
)

tbl(con_psql, "flights") %>%
  filter(!is.na(tailnum)) %>%
  filter(arr_delay >= 180) %>% 
  select(tailnum) %>%
  distinct()

However, this won’t work if we can’t connect to our database server. And since postgres.server is an alias to an IP only accessible from the local network at our office, we couldn’t run this code and get a result elsewhere. But what if we wanted to continue work on this analysis on the train home?

Important: This example is using phony authentication. Please never write your passwords in scripts, use your .Rprofile, an environment variable, or some other more secure method instead.

One option would be saving a CSV or TXT file of the data manually, and then manually reading it in to our R session. But this has a number of drawbacks: we have to mentally keep track of where each query is from, save it to the right file, read it in to the right place, etc. We also have to maintain a separate system or code path for reading in the saved files. {dittodb} can take care of all of this for us in the background, allowing us to record the results of the necessary queries, and playing them back when those same queries are called without a connection to the database.

While we are able to connect to the database (i.e. when we are at the office) we can save the results returned by queries with code like the following (by calling start_db_capturing() before the connection and the code that executes the queries and then stop_db_capturing() at the end):

library(dittodb)

start_db_capturing()

con_psql <- DBI::dbConnect(
    RPostgreSQL::PostgreSQL(),
    dbname = "dittodb",
    host = "postgres.server",
    user = "m.ciccone"
  )

flights_delayed <- tbl(con_psql, "flights") %>%
  filter(!is.na(tailnum)) %>%
  filter(arr_delay >= 180) %>% 
  select(tailnum) %>%
  distinct() %>%
  collect()

flights_delayed

dbDisconnect(con_psql)

stop_db_capturing()
## # A tibble: 1,883 x 1
##    tailnum
##    <chr>  
##  1 N912XJ 
##  2 N645JB 
##  3 N904WN 
##  4 N3BWAA 
##  5 N3CJAA 
##  6 N14972 
##  7 N667UA 
##  8 N998AT 
##  9 N521JB 
## 10 N16559 
## # … with 1,873 more rows
## [1] TRUE

Reproducing query results

If there was a success capturing one or more queries, then we are able to replicate the result connected to a different network or even without internet access:

with_mock_db({
  con_psql <- DBI::dbConnect(
    RPostgreSQL::PostgreSQL(),
    dbname = "travelling",
    host = "127.0.0.1",
    user = "m.ciccone"
  )

  flights_delayed_from_mock <- tbl(con_psql, "flights") %>%
    filter(!is.na(tailnum)) %>%
    filter(arr_delay >= 180) %>% 
    select(tailnum) %>%
    distinct() %>%
    collect()
  
  flights_delayed_from_mock
})
## # A tibble: 1,883 x 1
##    tailnum
##    <chr>  
##  1 N912XJ 
##  2 N645JB 
##  3 N904WN 
##  4 N3BWAA 
##  5 N3CJAA 
##  6 N14972 
##  7 N667UA 
##  8 N998AT 
##  9 N521JB 
## 10 N16559 
## # … with 1,873 more rows

One thing to note is that when using dbplyr, we need to be a bit careful that we wrap the entire interaction in with the database objects in with_mock_db if we are taking advantage of dbplyr’s lazy evaluation (which is by default) and use collect() to return the results when you want them recorded. Because dbplyr waits until the last possible second to request the data, if you don’t have a collect() call (or a call the will implicitly send the query) there won’t be a query called, and {dittodb} won’t see be able to record the response from that query.