dm allows you to create your own relational data models from local data frames. Once your data model is complete, you can deploy it a range of DBMSs using dm.
Creating a dm object from data frames
The example data set that we will be using is available through the nycflights13 package. The five tables that we are working with contain information about all flights that departed from the airports of New York to other destinations in the United States in 2013:
flights represents the trips taken by planes
airlines includes
the names of transport organizations (name)
their abbreviated codes (carrier)
airports indicates the ports of departure (origin) and of destination (dest)
weather contains meteorological information at each hour
planes describes characteristics of the aircraft
Once we’ve loaded {nycflights13}, the aforementioned tables are all in our work environment, ready to be accessed.
library(nycflights13)airports
#> # A tibble: 1,458 x 8
#> faa name lat lon alt tz dst tzone
#> <chr><chr><dbl><dbl><dbl><dbl><chr><chr>
#> 1 04G Lansdowne Airport 41.1 -80.61044 -5 A America/New_…
#> 2 06A Moton Field Municipa… 32.5 -85.7 264 -6 A America/Chic…
#> 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chic…
#> 4 06N Randall Airport 41.4 -74.4 523 -5 A America/New_…
#> 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/New_…
#> 6 0A9 Elizabethton Municip… 36.4 -82.21593 -5 A America/New_…
#> 7 0G6 Williams County Airp… 41.5 -84.5 730 -5 A America/New_…
#> 8 0G7 Finger Lakes Regiona… 42.9 -76.8 492 -5 A America/New_…
#> 9 0P2 Shoestring Aviation … 39.8 -76.61000 -5 U America/New_…
#> 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/Los_…
#> # … with 1,448 more rows
Your own data will probably not be available as an R package. Whatever format it is in, you will need to be able to load it as data frames into your R session. If the data is too large, consider using dm to connect to the database instead. See vignette("howto-dm-db") for details on using dm with databases.
Adding Tables
Our first step will be to tell dm which tables we want to work with and how they are connected. For that we can use dm(), passing in the table names as arguments.
The as_dm() function is an alternative that works if you already have a list of tables.
A dm is a list
dm objects behave like lists with a user- and console-friendly print format. In fact, using a dm as a nicer list for organizing your data frames in your environment is an easy first step towards using dm and its data modeling functionality.
Member referencing, by subscript and by name, and list and slice manipulation functions work just as you would expect on a dm object.
Even though we now have a dm object that contains all our data, we have not specified how our five tables are connected. To do this we need to define primary keys and foreign keys on the tables.
Primary keys and foreign keys are how relational database tables are linked with each other. A primary key is a column that has a unique value for each row within a table. A foreign key is a column containing the primary key for a row in another table.1 Foreign keys act as cross references between tables. They specify the relationships that gives us the relational database. For more information on keys and a crash course on databases, see vignette("howto-dm-theory").
Primary Keys
dm offers dm_enum_pk_candidates() to identify viable primary keys, and dm_add_pk() to add them.
Note that we demonstrate both named and positional arguments above.
Foreign Keys
To define how our tables are related, we use dm_add_fk() to add foreign keys. In calling dm_add_fk() the table argument is the table that needs a foreign key to link it to a second table. ref_table is the table to be linked to and it needs a primary key already defined for it.
#> # A tibble: 19 x 3
#> columns candidate why
#> <keys><lgl><chr>
#> 1 carrier TRUE ""
#> 2 tailnum FALSE "334264 entries (99.3%) of `flights$tailnum` not…
#> 3 dest FALSE "336776 entries (100%) of `flights$dest` not in …
#> 4 origin FALSE "336776 entries (100%) of `flights$origin` not i…
#> 5 air_time FALSE "Can't join on `x$value` x `y$value` because of …
#> 6 arr_delay FALSE "Can't join on `x$value` x `y$value` because of …
#> 7 arr_time FALSE "Can't join on `x$value` x `y$value` because of …
#> 8 day FALSE "Can't join on `x$value` x `y$value` because of …
#> 9 dep_delay FALSE "Can't join on `x$value` x `y$value` because of …
#> 10 dep_time FALSE "Can't join on `x$value` x `y$value` because of …
#> 11 distance FALSE "Can't join on `x$value` x `y$value` because of …
#> 12 flight FALSE "Can't join on `x$value` x `y$value` because of …
#> 13 hour FALSE "Can't join on `x$value` x `y$value` because of …
#> 14 minute FALSE "Can't join on `x$value` x `y$value` because of …
#> 15 month FALSE "Can't join on `x$value` x `y$value` because of …
#> 16 sched_arr_t… FALSE "Can't join on `x$value` x `y$value` because of …
#> 17 sched_dep_t… FALSE "Can't join on `x$value` x `y$value` because of …
#> 18 time_hour FALSE "Can't join on `x$value` x `y$value` because of …
#> 19 year FALSE "Can't join on `x$value` x `y$value` because of …
Having chosen a column from the successful candidates provided by dm_enum_fk_candidates(), we use the dm_add_fk() function to establish the foreign key linking the tables. In the second call to dm_add_fk() we complete the process for the flights and airlines tables that we started above. The carrier column in the airlines table will be added as the foreign key in flights.
Having created the required primary and foreign keys to link all the tables together, we now have a relational data model we can work with.
Visualization
Visualizing a data model is a quick and easy way to verify that we have successfully created the model we were aiming for. We can use dm_draw() at any stage of the process to generate a visual representation of the tables and the links between them: