fhircrackr: Handling HL7® FHIR® resources in R

2020-07-22

Introduction

fhircrackr is a package designed to help analyzing HL7 FHIR1 resources.

FHIR stands for Fast Healthcare Interoperability Resources and is a standard describing data formats and elements (known as “resources”) as well as an application programming interface (API) for exchanging electronic health records. The standard was created by the Health Level Seven International (HL7) health-care standards organization. For more information on the FHIR standard, visit https://www.hl7.org/fhir/.

While FHIR is a very useful standard to describe and exchange medical data in an interoperable way, it is not at all useful for statistical analyses of data. This is due to the fact that FHIR data is stored in many nested and interlinked resources instead of matrix-like structures.

Thus, to be able to do statistical analyses a tool is needed that allows converting these nested resources into data frames. This process of tabulating FHIR resources is not trivial, as the unpredictable degree of nesting and connectedness of the resources makes generic solutions to this problem not feasible.

We therefore implemented a package that makes it possible to download FHIR resources from a server into R and to tabulate these resources into (multiple) data frames.

The package is still under development. The CRAN version of the package contains all functions that are already stable, for more recent (but potentially unstable) developments, the development version of the package can be downloaded from GitHub using devtools::install_github("POLAR-fhiR/fhircrackr").

Prerequisites

The complexity of the problem requires a couple of prerequisites both regarding knowledge and access to data. We will shortly list the preconditions to using the fhircrackr package here:

  1. First of all, you need the endpoint of the FHIR server you want to access. If you don’t have your own FHIR server, you can use one of the publicly available servers, such as https://hapi.fhir.org/baseR4 or http://fhir.hl7.de:8080/baseDstu3. The endpoint of a FHIR server is often referred to as [base].

  2. To download resources from the server, you should be familiar with FHIR search requests. FHIR search allows you to download sets of resources that match very specific requirements. As the focus of this package is dealing with FHIR resources in R rather than the intricacies of FHIR search, we will mostly use simple examples of FHIR search requests. Most of them will have the form [base]/[type]?parameter(s), where [type] refers to the type of resource you are looking for and parameter(s) characterize specific properties those resources should have. https://hapi.fhir.org/baseR4/Patient?gender=female for example downloads all Patient resources from the FHIR server at https://hapi.fhir.org/baseR4/ that represent female patients.

  3. In the first step, fhircrackr downloads the resources in xml format into R. To specify which elements from the FHIR resources you want in your data frame, you should have at least some familiarity with XPath expressions. A good tutorial on XPath expressions can be found here.

In the following we’ll go through a typical workflow with fhircrackr step by step.

Download and flatten FHIR resources from a server

Example 1: Download Patient resources

We will start with a very simple example and use fhir_search() to download Patient resources from a publicly available HAPI server after we’ve loaded the package with library(fhircrackr):

The minimum information fhir_search() requires is a string containing the full FHIR search request in the argument request. In general, a FHIR search request returns a bundle of the resources you requested. If there are a lot of resources matching your request, the search result isn’t returned in one big bundle but distributed over several of them. If the argument max_bundles is set to its default Inf, fhir_search() will return all available bundles, meaning all resources matching your request. If you set it to 2 as in the example above, the download will stop after the first two bundles. Note that in this case, the result may not contain all the resources from the server matching your request.

If you want to connect to a FHIR server that uses basic authentication, you can supply the arguments username and password.

Because endpoints can sometimes be hard to reach, fhir_search() will start five attempts to connect to the endpoint before it gives up. With the arguments max_attempts and delay_between_attempts you can control this number as well the time interval between attempts.

As you can see in the next block of code, fhir_search() returns a list of xml objects where each list element represents one bundle of resources, so a list of two xml objects in our case:

If for some reason you cannot connect to a FHIR server at the moment but want to explore the following functions anyway, the package provides an example list of bundles containing Patient and MedicationStatement resources. See ?medication_bundles for how to use it.

Now we know that inside these xml objects there is the patient data somewhere. To get it out, we will use fhir_crack(). The most important argument fhir_crack() takes is bundles, the list of bundles that is returned by fhir_search(). The second important argument is design, an object that tells the function which data to extract from the bundle. fhir_crack() returns a list of data.frames.

In general, design has to be a named list containing one element per data frame that will be created. The element names of design are going to be the names of the resulting data frames. It usually makes sense to create one data frame per type of resource. Because we have just downloaded resources of the type Patient, the design here would be a list of length 1.

The elements of design are lists themselves. They can be of length 1 or length 2, depending on the level of precision in extracting the attributes. There are three levels of precision in extracting the data for our data frame with fhir_crack():

1. Extract all available attributes

If we want to extract all available attributes, the list describing the data frames inside design is a list of length 1, containing only an Xpath expression to the resource type we want to extract:

Note that this can easily become a rather wide and sparse data frame. This is due to the fact that every attribute appearing in at least one of the resources will be turned into a variable (i.e. column), even if none of the other resources contain this attribute. For those resources, the value on that attribute will be set to NA. Depending on the variability of the resources, the resulting data frame can contain a lot of NA values. If a resource has multiple entries for an attribute, these are pasted together using the string provided in the argument sep as a separator. The column names in this option are automatically generated by pasting together the path to the respective attribute, e.g. name.given.value

2. Extract all attributes at certain levels

We can extract all attributes that are found on certain levels of the resource, if we specify this level in an XPath expression and provide it as the second element of the list describing the data.frame:

"./*" for example tells fhir_crack() to extract all attributes that are located (exactly) one level below the Patient’s level given by the XPath expression //Patient. In this case, the column names are still automatically generated. Combinations of XPath expressions with conditions are also allowed.

3. Extract specific attributes

If we know exactly which attributes we want to extract, we can specify them in a named list that we provide as the second element of the list describing the data.frame:

This option will usually return the most tidy and clear data frames, because you have full control over the extracted columns including their name in the resulting data frame. You should always extract the resource id, because this is used to link to other resources you might also extract. If you are not sure which attributes are available or where they are located in the resource, it can be helpful to start by extracting all available attributes. Then you can get an overview over the available attributes and their location and continue by doing a second, more targeted extraction to get your final data frame.

Of course the previous example is using just one resource type. If you are interested in several types of resources, design will have more elements and the result will be a list of several data frames.

The abstract form design should therefore have is:

Example 2: Download MedicationStatement and corresponding Patient resources

In reality your FHIR search requests are probably going to be slightly more complex than just asking for Patient resources. Consider the following example where we want to download MedicationStatements referring to a certain medication we specify with its snomed code and also the Patient resources these MedicationStatements are linked to.

When the FHIR search request gets longer, it can be helpful to build up the request piece by piece like this:

Then we can download the resources:

And convert them into data frames, one for the MedicationStatements and one for the Patients:

design <- list(

    MedicationStatement = list(

        "//MedicationStatement",

        list(
            MS.ID              = "id",
            STATUS.TEXT        = "text/status",
            STATUS             = "status",
            MEDICATION.SYSTEM  = "medicationCodeableConcept/coding/system",
            MEDICATION.CODE    = "medicationCodeableConcept/coding/code",
            MEDICATION.DISPLAY = "medicationCodeableConcept/coding/display",
            DOSAGE             = "dosage/text",
            PATIENT            = "subject/reference",
            LAST.UPDATE        = "meta/lastUpdated"
        )
    ),

    Patients = list(

        "//Patient",
        "./*"
    )
)


list_of_tables <- fhir_crack(medication_bundles, design, verbose = 0)

head(list_of_tables$MedicationStatement)
#>   MS.ID STATUS.TEXT STATUS     MEDICATION.SYSTEM MEDICATION.CODE
#> 1 30233   generated active http://snomed.info/ct       429374003
#> 2 42012   generated active http://snomed.info/ct       429374003
#> 3 42091   generated active http://snomed.info/ct       429374003
#> 4 45646   generated active http://snomed.info/ct       429374003
#> 5 45724   generated active http://snomed.info/ct       429374003
#> 6 45802   generated active http://snomed.info/ct       429374003
#>   MEDICATION.DISPLAY           DOSAGE       PATIENT
#> 1   simvastatin 40mg 1 tab once daily Patient/30163
#> 2   simvastatin 40mg 1 tab once daily Patient/41945
#> 3   simvastatin 40mg 1 tab once daily Patient/42024
#> 4   simvastatin 40mg 1 tab once daily Patient/45579
#> 5   simvastatin 40mg 1 tab once daily Patient/45657
#> 6   simvastatin 40mg 1 tab once daily Patient/45735
#>                     LAST.UPDATE
#> 1 2019-09-26T14:34:44.543+00:00
#> 2 2019-10-09T20:12:49.778+00:00
#> 3 2019-10-09T22:44:05.728+00:00
#> 4 2019-10-11T16:17:42.365+00:00
#> 5 2019-10-11T16:30:24.411+00:00
#> 6 2019-10-11T16:32:05.206+00:00

head(list_of_tables$Patients)
#>      id gender  birthDate
#> 1 60096   male 2019-11-13
#> 2 49443 female 1970-10-19
#> 3 46213 female 2019-10-11
#> 4 45735   male 1970-10-11
#> 5 42024 female 1979-10-09
#> 6 58504   male 2019-11-08

As you can see, the result now contains two data frames, one for Patient resources and one for MedicationStatement resources.

Example 3: Multiple entries

A particularly complicated problem in flattening FHIR resources is caused by the fact that there can be multiple entries to an attribute. The profile according to which your FHIR resources have been built defines how often a particular attribute can appear in a resource. This is called the cardinality of the attribute. For example the Patient resource defined here can have zero or one birthdates but arbitrarily many addresses. In general, fhir_crack() will paste multiple entries for the same attribute together in the data frame, using the separator provided by the sep argument. In most cases this will work just fine, but there are some special cases that require a little more attention.

Let’s have a look at the following example, where we have a bundle containing just three Patient resources:

This bundle contains three Patient resources. The first resource has just one entry for the address attribute. The second Patient resource has two entries containing the same elements for the address attribute. The third Patient resource has a rather messy address attribute, with three entries containing different elements and also two entries for the id attribute.

Let’s see what happens if we extract all attributes:

As you can see, multiple entries for the same attribute (address and id) are pasted together. This works fine for Patient 2, but for Patient 3 you can see a problem with the number of entries that are displayed. The original Patient resource had three (incomplete) address entries, but because the first two of them use complementary elements (use and city vs. type and country), the resulting pasted entries look like there had just been two entries for the address attribute.

You can counter this problem with the add_indices argument and customize the appearance of the indices with brackets:

Now the indices display the entry the value belongs to. That way you can see that Patient resource 3 had three entries for the attribute address and you can also see which attributes belong to which entry.

Of course this is a very specific case that only occurs if your resources have multiple entries with complementary elements. In the vast majority of cases multiple entries in one resource will have the same structure, thus making numbering of those entries superfluous.

Process data frames with multiple entries

Melt data frames with multiple entries

If the data frame produced by fhir_crack() contains multiple entries, you’ll probably want to divide these entries into distinct observations at some point. This is where fhir_melt() comes into play. fhir_melt() takes an indexed data frame with multiple entries in one or several columns and spreads (aka melts) these entries over several rows:

The new variable resource_identifier maps which rows in the created data frame belong to which row (usually equivalent to one resource) in the original data frame. brackets and sep should be given the same character vectors that have been used to build the indices in fhir_melt(). columns is a character vector with the names of the variables you want to melt. You can provide more than one column here but it makes sense to only have variables from the same repeating attribute together in one call to fhir_melt():

If the names of the variables in your data frame have been generated automatically with fhir_crack() you can find all variable names belonging to the same attribute with fhir_common_columns():

With the argument all_columns you can control whether the resulting data frame contains only the molten columns or all columns of the original data frame:

Values on the other variables will just repeat in the newly created rows.

If you try to melt several variables that don’t belong to the same attribute in one call to fhir_melt(), this will cause problems, because the different attributes won’t be combined correctly:

Instead, melt the attributes one after another:

cols <- fhir_common_columns(df2$Patients, "address")

molten_1 <- fhir_melt(df2$Patients, columns = cols, brackets = c("[","]"), 
                      sep=" ", all_columns = TRUE)
molten_1
#>                   id address.use address.city address.type address.country
#> 1             [1]id1     [1]home [1]Amsterdam  [1]physical  [1]Netherlands
#> 11            [1]id2     [1]home      [1]Rome  [1]physical        [1]Italy
#> 2             [1]id2     [1]work [1]Stockholm    [1]postal       [1]Sweden
#> 12 [1]id3.1 [2]id3.2     [1]home    [1]Berlin         <NA>            <NA>
#> 3  [1]id3.1 [2]id3.2     [1]work    [1]London    [1]postal      [1]England
#> 21 [1]id3.1 [2]id3.2        <NA>         <NA>    [1]postal       [1]France
#>        birthDate resource_identifier
#> 1  [1]1992-02-06                   1
#> 11 [1]1980-05-23                   2
#> 2  [1]1980-05-23                   2
#> 12 [1]1974-12-25                   3
#> 3  [1]1974-12-25                   3
#> 21 [1]1974-12-25                   3

molten_2 <- fhir_melt(molten_1, columns = "id", brackets = c("[","]"), 
                      sep=" ", all_columns = TRUE)
molten_2
#>         id address.use address.city address.type address.country     birthDate
#> 1    []id1     [1]home [1]Amsterdam  [1]physical  [1]Netherlands [1]1992-02-06
#> 11   []id2     [1]home      [1]Rome  [1]physical        [1]Italy [1]1980-05-23
#> 12   []id2     [1]work [1]Stockholm    [1]postal       [1]Sweden [1]1980-05-23
#> 13 []id3.1     [1]home    [1]Berlin         <NA>            <NA> [1]1974-12-25
#> 2  []id3.2     [1]home    [1]Berlin         <NA>            <NA> [1]1974-12-25
#> 14 []id3.1     [1]work    [1]London    [1]postal      [1]England [1]1974-12-25
#> 21 []id3.2     [1]work    [1]London    [1]postal      [1]England [1]1974-12-25
#> 15 []id3.1        <NA>         <NA>    [1]postal       [1]France [1]1974-12-25
#> 22 []id3.2        <NA>         <NA>    [1]postal       [1]France [1]1974-12-25
#>    resource_identifier
#> 1                    1
#> 11                   2
#> 12                   3
#> 13                   4
#> 2                    4
#> 14                   5
#> 21                   5
#> 15                   6
#> 22                   6

This will give you the appropriate cross product of all multiple entries.

Remove indices

Once you have sorted out the multiple entries, you might want to get rid of the indices in your data.frame. This can be achieved using fhir_rm_indices():

Again, brackets and sep should be given the same character vector that was used for fhir_crack() and fhir_melt()respectively.

Save and load downloaded bundles

Since fhir_crack() discards of all the data not specified in design, it makes sense to store the original search result for reproducibility and in case you realize later on that you need elements from the resources that you haven’t extracted at first.

There are two ways of saving the FHIR bundles you downloaded: Either you save them as R objects, or you write them to an xml file.

Save and load bundles as R objects

If you want to save the list of downloaded bundles as an .rda or .RData file, you cannot just R’s save()or save_image() on it, because this will break the external pointers in the xml objects representing your bundles. Instead, you have to serialize the bundles before saving and unserialize them after loading. For single xml objects the package xml2 proved serialization functions. For convenience, however, fhircrackr provides the functions fhir_serialize() and fhir_unserialize() that can be used directly on the list of bundles returned by fhir_search():

If you load this bundle again, you have to unserialize it before you can work with it:

After unserialization, the pointers are restored and you can continue to work with the bundles. Note that the example bundle medication_bundles that is provided with the fhircrackr package is also provided in its serialized form and has to be unserialized as described on its help page.

Save and load bundles as xml files

If you want to store the bundles in xml files instead of R objects, you can use the functions fhir_save() and fhir_load(). fhir_save() takes a list of bundles in form of xml objects (as returned by fhir_search()) and writes them into the directory specified in the argument directory. Each bundle is saved as a separate xml-file. If the folder defined in directory doesn’t exist, it is created in the current working directory.

To read bundles saved with fhir_save() back into R, you can use fhir_load():

fhir_load() takes the name of the directory (or path to it) as its only argument. All xml-files in this directory will be read into R and returned as a list of bundles in xml format just as returned by fhir_search().

Download capability statement

The capability statement documents a set of capabilities (behaviors) of a FHIR Server for a particular version of FHIR. You can download this statement using the function fhir_capability_statement():

cap <- fhir_capability_statement("http://hapi.fhir.org/baseR4/", verbose = 0)

fhir_capability_statement() takes a FHIR server endpoint and returns a list of data frames containing all information from the capability statement of this server.

Further Options

Extract data below resource level

While we recommend extracting exactly one data frame per resource, it is technically possible to choose a different level per data frame:

The above example shows that instead of the MedicationStatement resource, we can choose the MedicationCodeableConcept as the root level for our extraction. This can be useful to get a quick and relatively clean overview over the types of codes used on this level of the resource. It is however important to note that this mode of extraction makes it impossible to recognize if each row belongs to one resource or if several of these rows came from the same resource. This of course also means that you cannot link this information to data from other resources because this extraction mode discards of that information.

Acknowledgements

This work was carried out by the SMITH consortium and the cross-consortium use case POLAR_MI; both are part of the German Initiative for Medical Informatics and funded by the German Federal Ministry of Education and Research (BMBF), grant no. 01ZZ1803A , 01ZZ1803C and 01ZZ1910A.


  1. FHIR is the registered trademark of HL7 and is used with the permission of HL7. Use of the FHIR trademark does not constitute endorsement of this product by HL7