Using etl

Ben Baumer

2020-06-01

etl is an R package to facilitate Extract - Transform - Load (ETL) operations for medium data. The end result is generally a populated SQL database, but the user interaction takes place solely within R.

Using etl

Instantiate an etl object using a string that determines the class of the resulting object, and the package that provides access to that data. The trivial mtcars database is built into etl.

library(etl)
cars <- etl("mtcars")
## No database was specified so I created one for you at:
## /tmp/RtmpCrveYY/file52e2163267e8.sqlite3
class(cars)
## [1] "etl_mtcars"           "etl"                  "src_SQLiteConnection"
## [4] "src_dbi"              "src_sql"              "src"

Pay careful attention to where the SQLite database is stored. The default location is a temporary directory, but you will want to move this to a more secure location if you want this storage to be persistent. See file.copy() for examples on how to move a file.

Connect to a local or remote database

etl works with a local or remote database to store your data. Every etl object extends a dplyr::src_dbi object. If, as in the example above, you do not specify a SQL source, a local RSQLite database will be created for you. However, you can also specify any source that inherits from dplyr::src_dbi.

Note: If you want to use a database other than a local RSQLite, you must create the mtcars database and have permission to write to it first!

# For PostgreSQL
library(RPostgreSQL)
db <- src_postgres(dbname = "mtcars", user = "postgres", host = "localhost")

# Alternatively, for MySQL
library(RMySQL)
db <- src_mysql(dbname = "mtcars", user = "r-user", password = "mypass", host = "localhost")
cars <- etl("mtcars", db)

At the heart of etl are three functions: etl_extract(), etl_transform(), and etl_load().

Extract

The first step is to acquire data from an online source.

cars %>%
  etl_extract()
## Extracting raw data...

This creates a local store of raw data.

Transform

These data may need to be transformed from their raw form to files suitable for importing into SQL (usually CSVs).

cars %>%
  etl_transform()

Load

Populate the SQL database with the transformed data.

cars %>%
  etl_load()
## Loading 12 file(s) into the database...

Do it all at once

To populate the whole database from scratch, use etl_create.

cars %>%
  etl_create()
## Initializing DB using SQL script init.sqlite
## Extracting raw data...
## Loading 12 file(s) into the database...

You can also update an existing database without re-initializing, but watch out for primary key collisions.

cars %>%
  etl_update()

Step-by-step

Under the hood, there are three functions that etl_update chains together:

getS3method("etl_update", "default")
## function (obj, ...) 
## {
##     obj <- obj %>% etl_extract(...) %>% etl_transform(...) %>% 
##         etl_load(...)
##     invisible(obj)
## }
## <bytecode: 0x55bf4502f260>
## <environment: namespace:etl>

etl_create is simply a call to etl_update that forces the SQL database to be written from scratch.

getS3method("etl_create", "default")
## function (obj, ...) 
## {
##     obj <- obj %>% etl_init(...) %>% etl_update(...) %>% etl_cleanup(...)
##     invisible(obj)
## }
## <bytecode: 0x55bf425cf968>
## <environment: namespace:etl>

Do Your Analysis

Now that your database is populated, you can work with it as a src data table just like any other dplyr source.

cars %>%
  tbl("mtcars") %>%
  group_by(cyl) %>%
  summarise(N = n(), mean_mpg = mean(mpg))
## Warning: Missing values are always removed in SQL.
## Use `mean(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.
## # Source:   lazy query [?? x 3]
## # Database: sqlite 3.30.1 [/tmp/RtmpCrveYY/file52e2163267e8.sqlite3]
##     cyl     N mean_mpg
##   <int> <int>    <dbl>
## 1     4    11     26.7
## 2     6     7     19.7
## 3     8    14     15.1

Extending etl

Create your own ETL packages

Suppose you want to create your own ETL package called pkgname. All you have to do is write a package that requires etl, and then you have to write one S3 methods:

etl_extract.etl_pkgname()

You may also wish to write

etl_transform.etl_pkgname()
etl_load.etl_pkgname()

All of these functions must take and return an object of class etl_pkgname that inherits from etl. Please see the “Extending etl” vignette for more information.

Use other ETL packages

Packages that use the etl framework are available on CRAN and/or GitHub:

tools::dependsOnPkgs("etl")
## [1] "airlines"  "macleish"  "retro"     "statcastr"