A dm object can be created from any database that has a {DBI} backend (see list).
When a dm object is created it can either import all the tables in the database, the active schema or a limited set. For some RDBMS, such as Postgres and SQL Server, primary and foreign keys are also imported and do not have to be manually added afterwards.
There is a financial dataset that contains loan data along with relevant account information and transactions. We chose this dataset because the relationships between loan
, account
, transactions
tables are a good representation of databases that record real-world business transactions. The repository uses a MariaDB server for which {dm} does not currently import primary or foreign keys, so we will need to add them.
The dataset is available on database server that is publicly accessible without registration. Connection details will vary for your database, see vignette("DBI", package = "DBI")
for further information.
library(RMariaDB)
my_db <- dbConnect(
MariaDB(),
username = "guest",
password = "relational",
dbname = "Financial_ijs",
host = "relational.fit.cvut.cz"
)
Creating a dm object takes a single call to dm_from_src()
with the DBI connection object as its argument.
#> ── Table source ───────────────────────────────────────────────────────────
#> src: mysql [guest@relational.fit.cvut.cz:NA/Financial_ijs]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `accounts`, `cards`, `clients`, `disps`, `districts`, … (9 total)
#> Columns: 57
#> Primary keys: 0
#> Foreign keys: 0
The components of the my_dm
object are lazy tables powered by {dbplyr}. {dbplyr} translates the {dplyr} grammar of data manipulation into queries the database server understands. Lazy tables defer downloading of table data until results are required for printing or local processing.
A dm can also be constructed from individual tables or views. This is useful for when you want to work with a subset of a database’s tables, perhaps from different schemas.
Below we use the tbl()
function from {dplyr} to extract two tables from the financial database. Then we create our dm by passing the tables in as arguments. Note that the tables arguments have to all be from the same source.
dbListTables(my_db)
#> [1] "accounts" "cards" "clients" "disps" "districts" "loans"
#> [7] "orders" "tkeys" "trans"
library(dbplyr)
loans <- tbl(my_db, "loans")
accounts <- tbl(my_db, "accounts")
my_manual_dm <- dm(loans, accounts)
my_manual_dm
#> ── Table source ───────────────────────────────────────────────────────────
#> src: mysql [guest@relational.fit.cvut.cz:NA/Financial_ijs]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `loans`, `accounts`
#> Columns: 11
#> Primary keys: 0
#> Foreign keys: 0
Primary keys and foreign keys are how relational database tables are linked with each other. A primary key is a column that has a unique value for each row within a table. A foreign key is a column containing the primary key for a row in another table.1 Foreign keys act as cross references between tables. They specify the relationships that gives us the relational database. For more information on keys and a crash course on databases, see vignette("howto-dm-theory")
.
The model diagram provided by our test database loosely illustrates the intended relationships between tables. In the diagram we can see that the loans
table should be linked to the accounts
table. Below we create those links in 3 steps:
id
to the accounts
tableid
to the loans
tableaccount_id
to the loans
table referencing the accounts
tableThen we assign colors to the tables and draw the structure of the dm.
Note that when the foreign key is created the primary key in the referenced table does not need to be specified, but the primary key must already be defined. And, as mentioned above, primary and foreign key constraints on the database are currently only imported for Postgres and SQL Server databases, and only when dm_from_src()
is used. This process of key definition needs to be done manually for other databases.
my_dm_keys <-
my_manual_dm %>%
dm_add_pk(accounts, id) %>%
dm_add_pk(loans, id) %>%
dm_add_fk(loans, account_id, accounts) %>%
dm_set_colors(green = loans, orange = accounts)
my_dm_keys %>%
dm_draw()
Once you have instantiated a dm object you can continue to add tables to it. For tables from the original source for the dm, use dm_add_tbl()
#> ── Table source ───────────────────────────────────────────────────────────
#> src: mysql [guest@relational.fit.cvut.cz:NA/Financial_ijs]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `loans`, `accounts`, `trans`
#> Columns: 21
#> Primary keys: 2
#> Foreign keys: 1
For tables from other sources or from the local environment dplyr::copy_to()
is used. copy_to()
is discussed later in this article.
Like other R objects, a dm is immutable and all operations performed on it are transient unless stored in a new variable.
#> ── Table source ───────────────────────────────────────────────────────────
#> src: mysql [guest@relational.fit.cvut.cz:NA/Financial_ijs]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `loans`, `accounts`
#> Columns: 11
#> Primary keys: 2
#> Foreign keys: 1
#> ── Table source ───────────────────────────────────────────────────────────
#> src: mysql [guest@relational.fit.cvut.cz:NA/Financial_ijs]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `loans`, `accounts`, `trans`
#> Columns: 21
#> Primary keys: 2
#> Foreign keys: 1
And, like {dbplyr}, results are never written to a database unless explicitly requested.
#> # Source: lazy query [?? x 10]
#> # Database: mysql [guest@relational.fit.cvut.cz:NA/Financial_ijs]
#> id account_id loans.date amount duration payments status district_id
#> <int> <int> <date> <dbl> <int> <dbl> <chr> <int>
#> 1 4959 2 1994-01-05 80952 24 3373 A 1
#> 2 4961 19 1996-04-29 30276 12 2523 B 21
#> 3 4962 25 1997-12-08 30276 12 2523 A 68
#> 4 4967 37 1998-10-14 318480 60 5308 D 20
#> 5 4968 38 1998-04-19 110736 48 2307 C 19
#> 6 4973 67 1996-05-02 165960 24 6915 A 16
#> 7 4986 97 1997-08-10 102876 12 8573 A 74
#> 8 4988 103 1997-12-06 265320 36 7370 D 44
#> 9 4989 105 1998-12-05 352704 48 7348 C 21
#> 10 4990 110 1997-09-08 162576 36 4516 C 36
#> # … with more rows, and 2 more variables: frequency <chr>,
#> # accounts.date <date>
my_dm_keys %>%
dm_flatten_to_tbl(loans) %>%
sql_render()
#> Renamed columns:
#> * date -> loans.date, accounts.date
#> <SQL> SELECT `LHS`.`id` AS `id`, `LHS`.`account_id` AS `account_id`, `LHS`.`loans.date` AS `loans.date`, `LHS`.`amount` AS `amount`, `LHS`.`duration` AS `duration`, `LHS`.`payments` AS `payments`, `LHS`.`status` AS `status`, `RHS`.`district_id` AS `district_id`, `RHS`.`frequency` AS `frequency`, `RHS`.`accounts.date` AS `accounts.date`
#> FROM (SELECT `id`, `account_id`, `date` AS `loans.date`, `amount`, `duration`, `payments`, `status`
#> FROM `loans`) `LHS`
#> LEFT JOIN (SELECT `id`, `district_id`, `frequency`, `date` AS `accounts.date`
#> FROM `accounts`) `RHS`
#> ON (`LHS`.`account_id` = `RHS`.`id`)
As the dm is a collection of tables, if we wish to perform operations on an individual table we set it as the context for those operations using dm_zoom_to()
. See vignette("tech-dm-zoom")
for more detail on zooming.
dm operations are transient unless persistence is explicitly requested. To make our chain of manipulations on the selected table permanent we assign the result of dm_insert_zoomed()
to a new object, my_dm_total
. This is a new dm object, derived from my_dm_keys
, with a new lazy table total_loans
linked to the accounts
table. The subsequent section describes how to materialize the results on the database.
my_dm_total <-
my_dm_keys %>%
dm_zoom_to(loans) %>%
group_by(account_id) %>%
summarize(total_amount = sum(amount, na.rm = TRUE)) %>%
ungroup() %>%
dm_insert_zoomed("total_loans")
my_dm_total$total_loans
#> # Source: lazy query [?? x 2]
#> # Database: mysql [guest@relational.fit.cvut.cz:NA/Financial_ijs]
#> account_id total_amount
#> <int> <dbl>
#> 1 2 80952
#> 2 19 30276
#> 3 25 30276
#> 4 37 318480
#> 5 38 110736
#> 6 67 165960
#> 7 97 102876
#> 8 103 265320
#> 9 105 352704
#> 10 110 162576
#> # … with more rows
After adding columns to link tables or calculate summaries of data, we might want these changes to the database to persist.
To force a dm to execute the SQL query generated by the operations it has performed, we call the dplyr::compute()
method on the dm object.
compute()
forces the computation of a query, in this case the query or multiple queries created by the dm to represent all operations that have been performed but not yet evaluated. The results of compute()
are stored in temporary tables on the database server and will be deleted when your session ends. If you want results to persist across sessions in permanent tables, compute()
must be called with the argument temporary = FALSE
and a table name for the name
argument. See the compute()
documentation for more details.
Calling compute()
requires write access, otherwise an error is returned. Our example RDBMS, relational.fit.cvut.cz, does not allow write access. As a workaround to demonstrate the usage of compute()
, we will use dm_financial_sqlite()
, a convenience function that handles the copying of the dm from the remote RDBMS to a local SQLite database we can write to.
my_dm_sqlite <- dm_financial_sqlite()
my_dm_total <-
my_dm_sqlite %>%
dm_zoom_to(loans) %>%
group_by(account_id) %>%
summarize(total_amount = sum(amount, na.rm = TRUE)) %>%
ungroup() %>%
dm_insert_zoomed("total_loans")
Two {dplyr} verbs have been implemented for dm objects. compute()
, as mentioned above, materializes all tables into new (temporary or persistent) tables.
#> # Source: table<dbplyr_049> [?? x 2]
#> # Database: sqlite 3.30.1 []
#> account_id total_amount
#> <int> <dbl>
#> 1 2 80952
#> 2 19 30276
#> 3 25 30276
#> 4 37 318480
#> 5 38 110736
#> 6 67 165960
#> 7 97 102876
#> 8 103 265320
#> 9 105 352704
#> 10 110 162576
#> # … with more rows
collect()
downloads all tables to local data frames.
#> # A tibble: 682 x 2
#> account_id total_amount
#> <int> <dbl>
#> 1 2 80952
#> 2 19 30276
#> 3 25 30276
#> 4 37 318480
#> 5 38 110736
#> 6 67 165960
#> 7 97 102876
#> 8 103 265320
#> 9 105 352704
#> 10 110 162576
#> # … with 672 more rows
There is a third {dbplyr} verb that has not yet been implemented. collapse()
forces generation of the SQL query instead of computation (#304).
compute()
also works for a zoomed dm. Calling it during a chain of operations will execute all relevant SQL queries up to that point. Operations after the compute()
will use the generated results.
To deploy a dm we must copy it to a RDBMS. This is done using the method copy_dm_to()
, which is used behind the scenes by our dm_financial_sqlite()
function, the code for which appears below.
dm_financial_sqlite
#> Memoised Function:
#> function ()
#> {
#> stopifnot(rlang::is_installed("RSQLite"))
#> my_dm <- dm_financial() %>% dm_select_tbl(-trans)
#> sqlite_db <- DBI::dbConnect(RSQLite::SQLite())
#> sqlite_dm <- copy_dm_to(sqlite_db, my_dm, temporary = FALSE)
#> sqlite_dm
#> }
#> <bytecode: 0x55f922b35860>
#> <environment: namespace:dm>
As demonstrated, copy_dm_to()
takes as arguments a {DBI} connection that is the destination, the dm we wish to deploy, and here the temporary
argument set to FALSE as this is a deployment and we want it to be permanent. The dm is copied to the DBI connection and a new dm, with the DBI connection as its source, is returned.
The default behavior is to create temporary tables and, where possible (currently Postgres and SQL server), to set any key constraints.
If you need to add local data frames to an existing database, the shortest path is to use copy_to()
. It takes the same arguments as copy_dm_to()
, except the second argument, instead of expecting a dm, expects a data frame.
The example below estimates a linear model from attributes in the loans
and districts
tables, inserts residuals into the database, and links them to the loans
table.
loans_df <-
my_dm_sqlite %>%
dm_squash_to_tbl(loans) %>%
select(id, amount, duration, A3) %>%
collect()
#> Renamed columns:
#> * date -> loans.date, accounts.date
model <- lm(amount ~ duration + A3, data = loans_df)
loans_residuals <- tibble::tibble(
id = loans_df$id,
resid = unname(residuals(model))
)
my_dm_sqlite_resid <-
copy_to(my_dm_sqlite, loans_residuals, temporary = FALSE) %>%
dm_add_pk(loans_residuals, id) %>%
dm_add_fk(loans_residuals, id, loans)
my_dm_sqlite_resid %>%
dm_draw()
#> # Source: table<loans_residuals_2020_07_29_23_34_47_1> [?? x 2]
#> # Database: sqlite 3.30.1 []
#> id resid
#> <int> <dbl>
#> 1 4959 -31912.
#> 2 4961 -27336.
#> 3 4962 -30699.
#> 4 4967 63621.
#> 5 4968 -94811.
#> 6 4973 59036.
#> 7 4986 41901.
#> 8 4988 123392.
#> 9 4989 147157.
#> 10 4990 33377.
#> # … with more rows
In this tutorial we have demonstrated how to use {dm} to work with existing databases and to construct your own from local tables, including specifying key constraints, and then deploy them to a RDBMS. If you would like more details an overview of all the methods available in {dm}, please see the reference documentation.