riskCommunicator package manuscript vignette

Jessica Grembi, Elizabeth Rogawski McQuade

2020-06-22

Introduction to riskCommunicator

The riskCommunicator package facilitates the estimation of common epidemiological effect measures that are relevant to public health, but that are often not trivial to obtain from common regression models, like logistic regression. In particular, riskCommunicator estimates risk and rate differences, in addition to risk and rate ratios. The package estimates these effects using g-computation with the appropriate parametric model depending on the outcome (logistic regression for binary outcomes, Poisson regression for rate or count outcomes, and linear regression for continuous outcomes). Therefore, the package can handle binary, rate, count, and continuous outcomes and allows for dichotomous, categorical (>2 categories), or continuous exposure variables. Additional features include estimation of effects stratified by subgroup and adjustment of standard errors for clustering. Confidence intervals are constructed by bootstrap at the individual or cluster level, as appropriate.

This package operationalizes g-computation, which has not been widely adopted due to computational complexity, in an easy-to-use implementation tool to increase the reporting of more interpretable epidemiological results. To make the package accessible to a broad range of health researchers, our goal was to design a function that was as straightforward as the standard logistic regression functions in R (e.g. glm) and that would require little to no expertise in causal inference methods or advanced coding.

Getting started

Installation

The riskCommunicator R package is available as a source package through GitHub. Installation using the devtools package:

library(devtools)
devtools::install_github("jgrembi/riskCommunicator")

Load packages:

library(riskCommunicator)
library(tidyverse)
#> ── Attaching packages ──────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
#> ✓ ggplot2 3.3.1     ✓ purrr   0.3.4
#> ✓ tibble  3.0.1     ✓ dplyr   1.0.0
#> ✓ tidyr   1.1.0     ✓ stringr 1.4.0
#> ✓ readr   1.3.1     ✓ forcats 0.5.0
#> ── Conflicts ─────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()
library(printr)
#> Registered S3 method overwritten by 'printr':
#>   method                from     
#>   knit_print.data.frame rmarkdown

Description of main package function

The gComp function is the main function in the riskCommunicator package and allows you to estimate a variety of effects depending on your outcome and exposure of interest. The function is coded as follows:

?gComp
gComp R Documentation

Estimate difference and ratio effects with 95% confidence intervals.

Usage

gComp(
  data,
  outcome.type = c("binary", "count", "rate", "continuous"),
  formula = NULL,
  Y = NULL,
  X = NULL,
  Z = NULL,
  subgroup = NULL,
  offset = NULL,
  rate.multiplier = 1,
  exposure.scalar = 1,
  R = 200,
  clusterID = NULL,
  parallel = "no",
  ncpus = getOption("boot.ncpus", 1L)
)

Arguments

data

(Required) A data.frame containing variables for Y, X, and Z or with variables matching the model variables specified in a user-supplied formula. Data set should also contain variables for the optional subgroup and offset, if they are specified.

outcome.type

(Required) Character argument to describe the outcome type. Acceptable responses, and the corresponding error distribution and link function used in the glm, include:

binary

(Default) A binomial distribution with link = ‘logit’ is used.

count

A Poisson distribution with link = ‘log’ is used.

rate

A Poisson distribution with link = ‘log’ is used; ideal for events/person-time outcomes.

continuous

A gaussian distribution with link = ‘identity’ is used.

formula

(Optional) Default NULL. An object of class “formula” (or one that can be coerced to that class) which provides the the complete model formula, similar to the formula for the glm function in R (e.g. ‘Y ~ X + Z1 + Z2 + Z3’). Can be supplied as a character or formula object. If no formula is provided, Y and X must be provided.

Y

(Optional) Default NULL. Character argument which specifies the outcome variable. Can optionally provide a formula instead of Y and X variables.

X

(Optional) Default NULL. Character argument which specifies the exposure variable (or treatment group assignment), which can be binary, categorical, or continuous. This variable can be supplied as a factor variable (for binary or categorical exposures) or a continuous variable. For binary/categorical exposures, X should be supplied as a factor with the lowest level set to the desired referent. Numeric variables are accepted, but will be centered (see Note). Character variables are not accepted and will throw an error. Can optionally provide a formula instead of Y and X variables.

Z

(Optional) Default NULL. List or single character vector which specifies the names of covariates or other variables to adjust for in the glm function. All variables should either be factors, continuous, or coded 0/1 (i.e. not character variables). Does not allow interaction terms.

subgroup

(Optional) Default NULL. Character argument that indicates subgroups for stratified analysis. Effects will be reported for each category of the subgroup variable. Variable will be automatically converted to a factor if not already.

offset

(Optional, only applicable for rate outcomes) Default NULL. Character argument which specifies the person-time denominator for rate outcomes to be included as an offset in the Poisson regression model. Numeric variable should be on the linear scale; function will take natural log before including in the model.

rate.multiplier

(Optional, only applicable for rate outcomes) Default 1. Numeric value to multiply to the rate-based effect measures. This option facilitates reporting effects with interpretable person-time denominators. For example, if the person-time variable (offset) is in days, a multiplier of 365*100 would result in estimates of rate differences per 100 person-years.

exposure.scalar

(Optional, only applicable for continuous exposure) Default 1. Numeric value to scale effects with a continuous exposure. This option facilitates reporting effects for an interpretable contrast (i.e.  magnitude of difference) within the continuous exposure. For example, if the continuous exposure is age in years, a multiplier of 10 would result in estimates per 10-year increase in age rather than per a 1-year increase in age.

R

(Optional) Default 200. The number of data resamples to be conducted to produce the bootstrap confidence interval of the estimate.

clusterID

(Optional) Default NULL. Character argument which specifies the variable name for the unique identifier for clusters. This option specifies that clustering should be accounted for in the calculation of confidence intervals. The clusterID will be used as the level for resampling in the bootstrap procedure.

parallel

(Optional) Default “no.” The type of parallel operation to be used. Available options (besides the default of no parallel processing) include multicore" (not available for Windows) or “snow.” This argument is passed directly to boot. See note below about setting seeds and parallel computing.

ncpus

(Optional, only used if parallel is set to “multicore” or “snow”) Default 1. Integer argument for the number of CPUs available for parallel processing/ number of parallel operations to be used. This argument is passed directly to boot

Framingham Heart Study

We’ll demonstrate how to use the package with data from the Framingham Heart Study. The following information is from the official Framingham study documentation (https://biolincc.nhlbi.nih.gov/teaching/):

"The Framingham Heart Study is a long term prospective study of the etiology of cardiovascular disease among a population of free living subjects in the community of Framingham, Massachusetts. The Framingham Heart Study was a landmark study in epidemiology in that it was the first prospective study of cardiovascular disease and identified the concept of risk factors and their joint effects. The study began in 1948 and 5,209 subjects were initially enrolled in the study. Participants have been examined biennially since the inception of the study and all subjects are continuously followed through regular surveillance for cardiovascular outcomes. Clinic examination data has included cardiovascular disease risk factors and markers of disease such as blood pressure, blood chemistry, lung function, smoking history, health behaviors, ECG tracings, Echocardiography, and medication use. Through regular surveillance of area hospitals, participant contact, and death certificates, the Framingham Heart Study reviews and adjudicates events for the occurrence of Angina Pectoris, Myocardial Infarction, Heart Failure, and Cerebrovascular disease.

data(cvdd)

cvdd is a subset of the data collected as part of the Framingham study from 4,240 participants who conducted a baseline exam and were free of prevalent coronary heart disease when they entered the study. Participant clinic data was collected during three examination periods, approximately 6 years apart, from roughly 1956 to 1968. Each participant was followed for a total of 24 years for the outcome of the following events: Angina Pectoris, Myocardial Infarction, Atherothrombotic Infarction or Cerebral Hemorrhage (Stroke) or death.

NOTE: This is a “teaching” dataset. Specific methods were employed to ensure an anonymous dataset that protects patient confidentiality; therefore, this dataset is inappropriate for publication purposes." The use of these data for the purposes of this package were approved on 11Mar2019 (request #7161) by NIH/NHLBI.

Binary outcome example

Research question: what is the effect of having diabetes at the beginning of the study on the 24-year risk of cardiovascular disease or death due to any cause?

Here, we will estimate the risk difference, risk ratio, odds ratio, and number needed to treat, adjusting for patient’s age, sex, body mass index (BMI), smoking status (current smoker or not), and prevalence of hypertension (if they are hypertensive or not at baseline). Logistic regression is used as the underlying parametric model for g-computation.

## Specify the regression formula
cvdd.formula <- cvd_dth ~ DIABETES + AGE + SEX + BMI + CURSMOKE + PREVHYP

## For reproducibility, we should always set the seed since the g-computation uses random resampling of the data to calculate confidence intervals and random sampling of the distribution when predicting outcomes
set.seed(1298)

## Call the gComp function
binary.res <- gComp(data = cvdd, formula = cvdd.formula, outcome.type = "binary", R = 200)

binary.res
#> Formula: 
#> cvd_dth ~ DIABETES + AGE + SEX + BMI + CURSMOKE + PREVHYP 
#> 
#> Parameter estimates: 
#>                             DIABETES1_v._DIABETES0 Estimate (95% CI)
#> Risk Difference                                 0.300 (0.226, 0.376)
#> Risk Ratio                                      1.552 (1.417, 1.714)
#> Odds Ratio                                      4.550 (2.784, 8.863)
#> Number needed to treat/harm                                    3.330

The result obtained from the gComp function is an object of class gComp which is a list containing the summary results, results.df, n, R, boot.result, contrast, family, formula, predicted.outcome, and glm.result (see ?gComp or help(gComp) for a more detailed explanation of each item in the list).

class(binary.res)
#> [1] "gComp" "list"
# The names of the different items in the list 
names(binary.res)
#>  [1] "summary"           "results.df"        "n"                
#>  [4] "R"                 "boot.result"       "contrast"         
#>  [7] "family"            "formula"           "predicted.outcome"
#> [10] "glm.result"

# Sample size of the original data:
binary.res$n 
#> [1] 4240

# Contrast being compared in the analysis:
binary.res$contrast
#> [1] "DIABETES1 v. DIABETES0"

Rate outcome example

Research question: what is the effect of having diabetes at the beginning of the study on the rate of cardiovascular disease or death due to any cause?

Here, we will estimate the rate difference and rate ratio, adjusting for patient’s age, sex, body mass index (BMI), smoking status (current smoker or not), and prevalence of hypertension (if they are hypertensive or not at baseline). We have included timeout as the offset and a rate.multiplier of 365.25*100 so that the estimates are returned with units of 100 person-years. Poisson regression is used as the underlying parametric model for g-computation.

#modify the dataset to change the variable cvd_dth from a factor to a numeric variable since the outcome for Poisson regression must be numeric
cvdd.t <- cvdd %>%
  dplyr::mutate(cvd_dth = as.numeric(as.character(cvd_dth)),
                timeout = as.numeric(timeout))

set.seed(6534)

rate.res <- gComp(data = cvdd.t, Y = "cvd_dth", X = "DIABETES", Z = c("AGE", "SEX", "BMI", "CURSMOKE", "PREVHYP"), outcome.type = "rate", rate.multiplier = 365.25*100, offset = "timeout", R = 200)

rate.res
#> Formula: 
#> cvd_dth ~ DIABETES + AGE + SEX + BMI + CURSMOKE + PREVHYP 
#> 
#> Parameter estimates: 
#>                           DIABETES1_v._DIABETES0 Estimate (95% CI)
#> Incidence Rate Difference                     2.189 (1.498, 3.103)
#> Incidence Rate Ratio                          1.913 (1.615, 2.324)

Rate outcome with subgroups example

Research question: what is the effect of having diabetes at the beginning of the study on the rate of cardiovascular disease or death due to any cause, stratified by sex?

Here, we will estimate the same effects above, but in subgroups defined by sex.

#modify the dataset to change the variable cvd_dth from a factor to a numeric variable since the outcome for Poisson regression must be numeric

set.seed(6534)

rate.res.subgroup <- gComp(data = cvdd.t, Y = "cvd_dth", X = "DIABETES", Z = c("AGE", "SEX", "BMI", "CURSMOKE", "PREVHYP"), subgroup = "SEX", outcome.type = "rate", rate.multiplier = 365.25*100, offset = "timeout", R = 200)

rate.res.subgroup
#> Formula: 
#> cvd_dth ~ DIABETES + AGE + SEX + BMI + CURSMOKE + PREVHYP + DIABETES:SEX 
#> 
#> Parameter estimates: 
#>                           DIABETES1_v._DIABETES0_SEX0 Estimate (95% CI)
#> Incidence Rate Difference                          2.488 (1.295, 3.964)
#> Incidence Rate Ratio                               1.794 (1.418, 2.281)
#>                           DIABETES1_v._DIABETES0_SEX1 Estimate (95% CI)
#> Incidence Rate Difference                          1.918 (1.056, 3.147)
#> Incidence Rate Ratio                               2.044 (1.569, 2.698)

Checking model fit

To ensure that the parameter estimates from each bootstrap iteration are normally distributed, we can also look at the histogram and Q-Q plots of bootstrapped estimates by calling:

plot(binary.res)

sessionInfo()
#> R version 4.0.0 (2020-04-24)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Catalina 10.15.5
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] printr_0.1             forcats_0.5.0          stringr_1.4.0         
#>  [4] dplyr_1.0.0            purrr_0.3.4            readr_1.3.1           
#>  [7] tidyr_1.1.0            tibble_3.0.1           ggplot2_3.3.1         
#> [10] tidyverse_1.3.0        riskCommunicator_0.1.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.1.0 xfun_0.14        haven_2.3.1      lattice_0.20-41 
#>  [5] colorspace_1.4-1 vctrs_0.3.1      generics_0.0.2   htmltools_0.4.0 
#>  [9] yaml_2.2.1       blob_1.2.1       rlang_0.4.6      pillar_1.4.4    
#> [13] withr_2.2.0      glue_1.4.1       DBI_1.1.0        dbplyr_1.4.4    
#> [17] modelr_0.1.8     readxl_1.3.1     lifecycle_0.2.0  munsell_0.5.0   
#> [21] gtable_0.3.0     cellranger_1.1.0 rvest_0.3.5      evaluate_0.14   
#> [25] labeling_0.3     knitr_1.28       fansi_0.4.1      broom_0.5.6     
#> [29] Rcpp_1.0.4.6     backports_1.1.7  scales_1.1.1     jsonlite_1.6.1  
#> [33] farver_2.0.3     fs_1.4.1         gridExtra_2.3    hms_0.5.3       
#> [37] digest_0.6.25    stringi_1.4.6    grid_4.0.0       cli_2.0.2       
#> [41] tools_4.0.0      magrittr_1.5     crayon_1.3.4     pkgconfig_2.0.3 
#> [45] ellipsis_0.3.1   xml2_1.3.2       reprex_0.3.0     lubridate_1.7.9 
#> [49] assertthat_0.2.1 rmarkdown_2.2    httr_1.4.1       rstudioapi_0.11 
#> [53] R6_2.4.1         boot_1.3-25      nlme_3.1-148     compiler_4.0.0