Getting started with dm

Creating a dm object

dm objects can be created from individual tables or loaded directly from a relational data model on an RDBMS (relational database management system).

For this demonstration we’re going work with a model hosted on a public server. The first thing we need is a connection to the RDBMS hosting the data.

library(RMariaDB)

fin_db <- dbConnect(
  MariaDB(),
  username = 'guest',
  password = 'relational',
  dbname = 'Financial_ijs',
  host = 'relational.fit.cvut.cz'
)

We create a dm object from an RDBMS using dm_from_src(), passing in the connection object we just created as the first argument.

library(dm)

fin_dm <- dm_from_src(fin_db)
fin_dm

#> ── Table source ───────────────────────────────────────────────────────────
#> src:  mysql  [guest@relational.fit.cvut.cz:NA/Financial_ijs]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `accounts`, `cards`, `clients`, `disps`, `districts`, … (9 total)
#> Columns: 57
#> Primary keys: 0
#> Foreign keys: 0

The dm object interrogates the RDBMS for table and column information and, where implemented, primary and foreign keys. Currently, primary and foreign keys are only available from Postgres and SQL Server.

Selecting tables

The dm object can be accessed like a named list of tables:

names(fin_dm)
#> [1] "accounts"  "cards"     "clients"   "disps"     "districts" "loans"    
#> [7] "orders"    "tkeys"     "trans"
fin_dm$loans

#> # Source:   table<`loans`> [?? x 7]
#> # Database: mysql [guest@relational.fit.cvut.cz:NA/Financial_ijs]
#>       id account_id date       amount duration payments status
#>    <int>      <int> <date>      <dbl>    <int>    <dbl> <chr> 
#>  1  4959          2 1994-01-05  80952       24     3373 A     
#>  2  4961         19 1996-04-29  30276       12     2523 B     
#>  3  4962         25 1997-12-08  30276       12     2523 A     
#>  4  4967         37 1998-10-14 318480       60     5308 D     
#>  5  4968         38 1998-04-19 110736       48     2307 C     
#>  6  4973         67 1996-05-02 165960       24     6915 A     
#>  7  4986         97 1997-08-10 102876       12     8573 A     
#>  8  4988        103 1997-12-06 265320       36     7370 D     
#>  9  4989        105 1998-12-05 352704       48     7348 C     
#> 10  4990        110 1997-09-08 162576       36     4516 C     
#> # … with more rows

dplyr::count(fin_dm$trans)

#> # Source:   lazy query [?? x 1]
#> # Database: mysql [guest@relational.fit.cvut.cz:NA/Financial_ijs]
#>         n
#>   <int64>
#> 1 1056320

At the same time, most dm functions are pipe-friendly and support tidy evaluation. We can use [ or the dm_select_tbl() verb to derive a smaller dm with the loans, accounts, districts and trans tables:

fin_dm_small <- fin_dm[c("loans", "accounts", "districts", "trans")]
fin_dm_small <-
  fin_dm %>%
  dm_select_tbl(loans, accounts, districts, trans)

Linking tables by adding keys

In many cases, dm_from_src() already returns a dm with all keys set. If not, dm allows us to define primary and foreign keys ourselves.

In our data model, id columns uniquely identify records in the accounts and loans table, a primary key is added with dm_add_pk(). Each loan is linked to one account via the account_id column in the loans table, the relationship is established with dm_add_fk().

fin_dm_keys <-
  fin_dm_small %>%
  dm_add_pk(accounts, id) %>%
  dm_add_pk(loans, id) %>%
  dm_add_fk(loans, account_id, accounts) %>%
  dm_add_pk(trans, id) %>%
  dm_add_fk(trans, account_id, accounts) %>%
  dm_add_pk(districts, id) %>%
  dm_add_fk(accounts, district_id, districts)

Visualizing a data model

Having a diagram of the data model is the quickest way to verify we’re on the right track. We can display a visual summary of the dm at any time. The default is to display the table name, any defined keys and their links to other tables.

Visualizing the dm in its current state we see the keys we have created and how they link the tables together. Color guides the eye.

fin_dm_keys %>%
  dm_set_colors(green = c(loans, accounts), darkblue = trans, grey = districts) %>%
  dm_draw()

Accessing a data model as a table

If we want to perform modeling or analysis on this relational model we need to transform it into a tabular format that R functions can work with. dm_squash_to_tbl() will automatically follow foreign keys across tables to gather all the available columns into a single table.

fin_dm_keys %>%
  dm_squash_to_tbl(loans)
#> Renamed columns:
#> * date -> loans.date, accounts.date

#> # Source:   lazy query [?? x 25]
#> # Database: mysql [guest@relational.fit.cvut.cz:NA/Financial_ijs]
#>       id account_id loans.date amount duration payments status district_id
#>    <int>      <int> <date>      <dbl>    <int>    <dbl> <chr>        <int>
#>  1  4959          2 1994-01-05  80952       24     3373 A                1
#>  2  4961         19 1996-04-29  30276       12     2523 B               21
#>  3  4962         25 1997-12-08  30276       12     2523 A               68
#>  4  4967         37 1998-10-14 318480       60     5308 D               20
#>  5  4968         38 1998-04-19 110736       48     2307 C               19
#>  6  4973         67 1996-05-02 165960       24     6915 A               16
#>  7  4986         97 1997-08-10 102876       12     8573 A               74
#>  8  4988        103 1997-12-06 265320       36     7370 D               44
#>  9  4989        105 1998-12-05 352704       48     7348 C               21
#> 10  4990        110 1997-09-08 162576       36     4516 C               36
#> # … with more rows, and 17 more variables: frequency <chr>,
#> #   accounts.date <date>, A2 <chr>, A3 <chr>, A4 <int>, A5 <int>,
#> #   A6 <int>, A7 <int>, A8 <int>, A9 <int>, A10 <dbl>, A11 <int>,
#> #   A12 <dbl>, A13 <dbl>, A14 <int>, A15 <int>, A16 <int>

Apart from the rows printed above, no data has been fetched from the database. Use select() to reduce the number of columns fetched, and collect() to retrieve the entire result for local processing.

loans_df <-
  fin_dm_keys %>%
  dm_squash_to_tbl(loans) %>%
  select(id, amount, duration, A3) %>%
  collect()
#> Renamed columns:
#> * date -> loans.date, accounts.date

model <- lm(amount ~ duration + A3, data = loans_df)

model
#> 
#> Call:
#> lm(formula = amount ~ duration + A3, data = loans_df)
#> 
#> Coefficients:
#>       (Intercept)           duration  A3central Bohemia  
#>             14241               4109              -4044  
#>    A3east Bohemia    A3north Bohemia    A3north Moravia  
#>            -20248             -32977              -2577  
#>   A3south Bohemia    A3south Moravia     A3west Bohemia  
#>             -5940             -16507             -32616

Operations on table data within a dm

We don’t need to take the extra step of exporting the data to work with it. Through the dm object we have complete access to dplyr’s data manipulation verbs. These operate on the data within individual tables.

To work with a particular table we use dm_zoom_to() to set the context to our chosen table. Then we can perform any of the operations we want.

fin_dm_total <-
  fin_dm_keys %>%
  dm_zoom_to(loans) %>%
  group_by(account_id) %>%
  summarize(total_amount = sum(amount, na.rm = TRUE)) %>%
  ungroup() %>%
  dm_insert_zoomed("total_loans")

fin_dm_total$total_loans

#> # Source:   lazy query [?? x 2]
#> # Database: mysql [guest@relational.fit.cvut.cz:NA/Financial_ijs]
#>    account_id total_amount
#>         <int>        <dbl>
#>  1          2        80952
#>  2         19        30276
#>  3         25        30276
#>  4         37       318480
#>  5         38       110736
#>  6         67       165960
#>  7         97       102876
#>  8        103       265320
#>  9        105       352704
#> 10        110       162576
#> # … with more rows

Note that in the above example we use dm_insert_zoomed() to add the results as a new table to our data model. This table is temporary and will be deleted when our session ends. If you want to make permanent changes to your data model on an RDBMS please see the “Persisting results” section in vignette("howto-dm-db").

Checking constraints

It’s always smart to check that your data model follows its specifications. When building our own model or changing existing models by adding tables or keys, it is even more important that the new model is validated.

dm_examine_constrains() checks all primary and foreign keys and reports if they violate their expected constraints.

fin_dm_total %>%
  dm_examine_constraints()

#> ℹ All constraints satisfied.

For more on constraint checking, including cardinality, finding candidate columns for keys, and normalization, see vignette("tech-dm-low-level").

Next Steps

Now that you have been introduced to the basic operation of dm, the next step is to learn more about the dm methods that your particular use case requires.

Is your data in an RDBMS? Then move on to vignette("howto-dm-db") for a more detailed looking at working with an existing relational data model.

If your data is in data frames, then you want to read vignette("howto-dm-df") next.

If you feel you need to know more about relational data models in order to get the most out of dm, check out vignette("howto-dm-theory").

If you’re familiar with relational data models but want to know how to work with them in dm, then any of vignette("tech-dm-join"), vignette("tech-dm-filter"), or vignette("tech-dm-zoom") is a good next step.

Standing on the shoulders of giants

The {dm} package follows the tidyverse principles:

dm objects are immutable (your data will never be overwritten in place)
most functions used on dm objects are pipeable (i.e., return new dm or table objects)
tidy evaluation is used (unquoted function parameters are supported)

The {dm} package builds heavily upon the {datamodelr} package, and upon the tidyverse. We’re looking forward to a good collaboration!

The {polyply} package has a similar intent with a slightly different interface.

The {data.cube} package has quite the same intent using array-like interface.

Articles in the {rquery} package discuss join controllers and join dependency sorting, with the intent to move the declaration of table relationships from code to data.

The {tidygraph} package stores a network as two related tables of nodes and edges, compatible with {dplyr} workflows.

In object-oriented programming languages, object-relational mapping is a similar concept that attempts to map a set of related tables to a class hierarchy.