1 Introduction

The aim of this vignette is to illustrate the use/functionality of the glm_coef function. glm_coef can be used to display model coefficients with confidence intervals and p-values. The advantages and limitations of glm_coef are:

Recognises the main models used in epidemiology/public health.
Automatically back-transforms estimates and confidence intervals, when the model requires it.
Can use robust standard errors for the calculation of confidence intervals.
- Standard errors are used by default.
- The use of standard errors is restricted by the following classes of objects (models): gee, glm and survreg.
Can display nice labels for the names of the parameters.
Returns a data frame that can be modified and/or exported as tables for publications (with further editing).

We start by loading relevant packages and setting the theme for the plots (as suggested in the Template of this package):

rm(list = ls())
library(car)
library(broom)
library(tidyverse)
library(ggfortify)
library(mosaic)
library(huxtable)
library(jtools)
library(latex2exp)
library(pubh)
library(sjlabelled)
library(sjPlot)
library(sjmisc)

theme_set(sjPlot::theme_sjplot2(base_size = 10))
theme_update(legend.position = "top")
# options('huxtable.knit_print_df' = FALSE)
options('huxtable.knit_print_df_theme' = theme_article)
options('huxtable.autoformat_number_format' = list(numeric = "%5.2f"))
knitr::opts_chunk$set(collapse = TRUE, comment = NA)

2 Multiple Linear Regression

For continuous outcomes there is no need of exponentiating the results unless the outcome was fitted in the log-scale. In our first example we want to estimate the effect of smoking and race on the birth weight of babies.

We can generate factors and assign labels in the same pipe stream:

data(birthwt, package = "MASS")
birthwt <- birthwt %>%
  mutate(
    smoke = factor(smoke, labels = c("Non-smoker", "Smoker")),
    race = factor(race, labels = c("White", "African American", "Other"))
    ) %>%
  var_labels(
    bwt = 'Birth weight (g)',
    smoke = 'Smoking status',
    race = 'Race'
    )

Is good to start with some basic descriptive statistics, so we can compare the birth weight between groups.

birthwt %>%
  group_by(race, smoke) %>%
  summarise(
    n = n(),
    Mean = mean(bwt, na.rm = TRUE),
    SD = sd(bwt, na.rm = TRUE),
    Median = median(bwt, na.rm = TRUE),
    CV = rel_dis(bwt)
  ) 
`summarise()` regrouping output by 'race' (override with `.groups` argument)
# A tibble: 6 x 7
# Groups:   race [3]
  race             smoke          n  Mean    SD Median    CV
  <fct>            <fct>      <int> <dbl> <dbl>  <dbl> <dbl>
1 White            Non-smoker    44 3429.  710.  3593  0.207
2 White            Smoker        52 2827.  626.  2776. 0.222
3 African American Non-smoker    16 2854.  621.  2920  0.218
4 African American Smoker        10 2504   637.  2381  0.254
5 Other            Non-smoker    55 2816.  709.  2807  0.252
6 Other            Smoker        12 2757.  810.  3146. 0.294

From the previous table, the group with the lower birth weight was from babies born from African Americans who were smokers. The highest birth weight was from babies born from White non-smokers.

Another way to compare the means between the groups is with gen_bst_df which estimates means with corresponding bootstrapped CIs.

birthwt %>%
  gen_bst_df(bwt ~ race|smoke)

Birth weight (g)	LowerCI	UpperCI	Race	Smoking status
3428.75	3195.35	3634.14	White	Non-smoker
2826.85	2654.22	3004.10	White	Smoker
2854.50	2570.71	3152.76	African American	Non-smoker
2504.00	2094.42	2870.57	African American	Smoker
2815.78	2634.94	3003.48	Other	Non-smoker
2757.17	2275.97	3137.51	Other	Smoker

Another approach to tabular analysis is graphical analysis. For this comparison, box-plots would be the way to go, but in health sciences it is more common to see bar charts with error bars.

birthwt %>%
  bar_error(bwt ~ race, fill = ~ smoke) %>%
  axis_labs() %>%
  gf_labs(fill = "Smoking status:")

We fit a linear model.

model_norm <- lm(bwt ~ smoke + race, data = birthwt)

Note: Model diagnostics are not be discussed in this vignette.

Traditional output from the model:

model_norm %>% Anova() %>% tidy()

term	sumsq	df	statistic	p.value
smoke	7322574.73	1.00	15.46	0.00
race	8712354.03	2.00	9.20	0.00
Residuals	87631355.83	185.00

model_norm %>% tidy()

term	estimate	std.error	statistic
(Intercept)	3334.95	91.78	36.34
smokeSmoker	-428.73	109.04	-3.93
raceAfrican American	-450.36	153.12	-2.94
raceOther	-452.88	116.48	-3.89

Table of coefficients:

model_norm %>% 
  glm_coef(labels = model_labels(model_norm))

Parameter	Coefficient	Pr(>\|t\|)
Constant	3334.95 (3153.89, 3516.01)	< 0.001
Smoking status: Smoker	-428.73 (-643.86, -213.6)	< 0.001
Race: African American	-450.36 (-752.45, -148.27)	0.004
Race: Other	-452.88 (-682.67, -223.08)	< 0.001

Note: Compare results using robust standard errors.

model_norm %>%
  glm_coef(se_rob = TRUE, labels = model_labels(model_norm))

Parameter	Coefficient	Pr(>\|t\|)
Constant	3334.95 (3144.36, 3525.53)	< 0.001
Smoking status: Smoker	-428.73 (-652.88, -204.58)	< 0.001
Race: African American	-450.36 (-734.09, -166.63)	0.002
Race: Other	-452.88 (-701.4, -204.35)	< 0.001

The function glance from the broom package allow us to have a quick look at statistics related with the model.

model_norm %>% glance()

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual
0.12	0.11	688.25	8.68	0.00	4	-1501.11	3012.22	3028.43	87631355.83	185

To construct the effect plot, we can use plot_model from the sjPlot package. The advantage of plot_model is that recognises labelled data and uses that information for annotating the plots.

model_norm %>%
  plot_model("pred", terms = ~race|smoke, dot.size = 1.5, title = "")

When the explanatory variables are categorical, another option is emmip from the emmeans package. We can include CIs in emmip but as estimates are connected, the resulting plots look more messy, so I recommend emmip to look at the trace.

emmip(model_norm, smoke ~ race) %>%
  gf_labs(y = get_label(birthwt$bwt), x = "", col = "Smoking status")

3 Logistic Regression

For logistic regression we are interested in the odds ratios. We will look at the effect of amount of fibre intake on the development of coronary heart disease.

data(diet, package = "Epi")
diet <- diet %>%
  mutate(
    chd = factor(chd, labels = c("No CHD", "CHD"))
  ) %>%
  var_labels(
    chd = "Coronary Heart Disease",
    fibre = "Fibre intake (10 g/day)"
    )

We start with descriptive statistics:

diet %>% estat(~ fibre|chd)

	Coronary Heart Disease	N	Min.	Max.	Mean	Median	SD	CV
Fibre intake (10 g/day)	No CHD	288.00	0.60	5.35	1.75	1.69	0.58	0.33
	CHD	45.00	0.76	2.43	1.49	1.51	0.40	0.27

It is standard to plot the dependent variable in the \(y\)-axis, so in this case, we can use horizontal box-plots.

diet %>%
  gf_boxploth(chd ~ fibre, fill = "indianred3", alpha = 0.7) %>%
  axis_labs()

We fit a linear logistic model:

model_binom <- glm(chd ~ fibre, data = diet, family = binomial)

model_binom %>%
  glm_coef(labels = model_labels(model_binom))

Parameter	Odds ratio	Pr(>\|z\|)
Constant	0.95 (0.29, 3.11)	0.94
Fibre intake (10 g/day)	0.33 (0.15, 0.69)	0.00

Effect plot:

model_binom %>%
  plot_model("pred", terms = "fibre [all]", title = "")

3.1 Matched Case-Control Studies: Conditional Logistic Regression

We will look at a matched case-control study on the effect of oestrogen use and history of gall bladder disease on the development of endometrial cancer.

data(bdendo, package = "Epi") 
bdendo <- bdendo %>%
  mutate(
    cancer = factor(d, labels = c('Control', 'Case')),
    gall = factor(gall, labels = c("No GBD", "GBD")),
    est = factor(est, labels = c("No oestrogen", "Oestrogen"))
  ) %>%
  var_labels(
    cancer = 'Endometrial cancer',
    gall = 'Gall bladder disease',
    est = 'Oestrogen'
  )

We start with descriptive statistics:

bdendo %>%
  mutate(
    cancer = relevel(cancer, ref = "Case"),
    est = relevel(est, ref = "Oestrogen"),
    gall = relevel(gall, ref = "GBD")
  ) %>%
  copy_labels(bdendo) %>%
  cross_tab(cancer ~ est + gall) %>%
  theme_article()

	Endometrial cancer
	Case	Control	Total
	(N=63)	(N=252)	(N=315)
Oestrogen
- Oestrogen	56 (88.9%)	127 (50.4%)	183 (58.1%)
- No oestrogen	7 (11.1%)	125 (49.6%)	132 (41.9%)
Gall bladder disease
- GBD	17 (27.0%)	24 ( 9.5%)	41 (13.0%)
- No GBD	46 (73.0%)	228 (90.5%)	274 (87.0%)

We fit the conditional logistic model:

library(survival)
model_clogit <- clogit(cancer == 'Case'  ~ est * gall + strata(set), data = bdendo)

model_clogit %>%
  glm_coef(labels = c("Oestrogen/No oestrogen", "GBD/No GBD",
                      "Oestrogen:GBD Interaction"))

Parameter	Odds ratio	Pr(>\|z\|)
Oestrogen/No oestrogen	14.88 (4.49, 49.36)	< 0.001
GBD/No GBD	18.07 (3.2, 102.01)	0.001
Oestrogen:GBD Interaction	0.13 (0.02, 0.9)	0.039

Creating data frame needed to construct the effect plot:

require(ggeffects)
Loading required package: ggeffects
bdendo_pred <- ggemmeans(model_clogit, terms = c('gall', 'est'))

Effect plot:

bdendo_pred %>%
  gf_pointrange(predicted + conf.low + conf.high ~ x|group, col = ~ x) %>%
  gf_labs(y = "P(cancer)", x = "", col = get_label(bdendo$gall))

3.2 Ordinal Logistic Regression

We use data about house satisfaction.

library(ordinal)

Attaching package: 'ordinal'
The following object is masked from 'package:dplyr':

    slice
data(housing, package = "MASS")
housing <- housing %>%
  var_labels(
    Sat = "Satisfaction",
    Infl = "Perceived influence",
    Type = "Type of rental",
    Cont = "Contact"
    )

We fit the ordinal logistic model:

model_clm <- clm(Sat ~ Infl + Type + Cont, weights = Freq, data = housing)

model_clm %>%
  glm_coef(labels = model_labels(model_clm, intercept = FALSE))

Parameter	Ordinal OR	Pr(>\|Z\|)
Perceived influence: Low	0.61 (0.48, 0.78)	< 0.001
Perceived influence: Medium	2 (1.56, 2.55)	< 0.001
Contact: Low	1.76 (1.44, 2.16)	< 0.001
Perceived influence: High	3.63 (2.83, 4.66)	< 0.001
Contact: High	0.56 (0.45, 0.71)	< 0.001
Type of rental: Apartment	0.69 (0.51, 0.94)	0.018
Type of rental: Atrium	0.34 (0.25, 0.45)	< 0.001
Type of rental: Terrace	1.43 (1.19, 1.73)	< 0.001

Effect plots:

model_clm %>%
  plot_model(type = "pred", terms = c("Infl", "Cont"), 
           dot.size = 1, title = "") %>%
  gf_theme(axis.text.x = element_text(angle = 45, hjust = 1))

model_clm %>%
  plot_model(type = "pred", terms = c("Infl", "Type"), 
           dot.size = 1, title = "") %>%
  gf_theme(axis.text.x = element_text(angle = 45, hjust = 1))

emmip(model_clm, Infl ~ Type |Cont) %>%
  gf_labs(x = "Type of rental", col = "Perceived influence")

Note: In the previous table parameter estimates and confidence intervals for Perceived influence and Accommodation were not adjusted for multiple comparisons.

4 Poisson Regression

For Poisson regression we are interested in incidence rate ratios. We will look at the effect of sex, ethnicity and age group on number of absent days from school in a year.

data(quine, package = "MASS")
levels(quine$Eth) <- c("Aboriginal", "White")
levels(quine$Sex) <- c("Female", "Male")

quine <- quine %>%
  var_labels(
    Days = "Number of absent days",
    Eth = "Ethnicity",
    Age = "Age group"
    )

Descriptive statistics:

quine %>%
  group_by(Eth, Sex, Age) %>%
  summarise(
    n = n(),
    Mean = mean(Days, na.rm = TRUE),
    SD = sd(Days, na.rm = TRUE),
    Median = median(Days, na.rm = TRUE),
    CV = rel_dis(Days)
  ) 
`summarise()` regrouping output by 'Eth', 'Sex' (override with `.groups` argument)
# A tibble: 16 x 8
# Groups:   Eth, Sex [4]
   Eth        Sex    Age       n  Mean    SD Median    CV
   <fct>      <fct>  <fct> <int> <dbl> <dbl>  <dbl> <dbl>
 1 Aboriginal Female F0        5 17.6  17.4      11 0.987
 2 Aboriginal Female F1       15 18.9  16.3      13 0.865
 3 Aboriginal Female F2        9 32.6  27.3      20 0.839
 4 Aboriginal Female F3        9 14.6  14.9      10 1.02 
 5 Aboriginal Male   F0        8 11.5   7.23     12 0.629
 6 Aboriginal Male   F1        5  9.6   4.51      7 0.469
 7 Aboriginal Male   F2       11 30.9  17.8      32 0.576
 8 Aboriginal Male   F3        7 27.1  10.4      28 0.382
 9 White      Female F0        5 19.8   9.68     20 0.489
10 White      Female F1       17  7.76  6.48      6 0.834
11 White      Female F2       10  5.7   4.97      4 0.872
12 White      Female F3       10 13.5  11.5      12 0.851
13 White      Male   F0        9 13.6  20.9       7 1.54 
14 White      Male   F1        9  5.56  5.39      5 0.970
15 White      Male   F2       10 15.2  12.9      12 0.848
16 White      Male   F3        7 27.3  22.9      27 0.840

We start by fitting a standard Poisson linear regression model:

model_pois <- glm(Days ~ Eth + Sex + Age, family = poisson, data = quine)

model_pois %>%
  glm_coef(labels = model_labels(model_pois), se_rob = TRUE)

Parameter	Rate ratio	Pr(>\|z\|)
Constant	17.66 (11.08, 28.16)	< 0.001
Ethnicity: White	0.59 (0.43, 0.81)	0.001
Sex: Male	1.11 (0.81, 1.52)	0.51
Age group: F1	0.8 (0.48, 1.32)	0.38
Age group: F2	1.42 (0.87, 2.31)	0.16
Age group: F3	1.35 (0.81, 2.24)	0.255

model_pois %>% glance()

null.deviance	df.null	logLik	AIC	BIC	deviance	df.residual
2073.53	145	-1165.49	2342.98	2360.88	1742.50	140

4.1 Negative-binomial

The assumption is that the mean is equal than the variance. If that is the case, deviance should be close to the degrees of freedom of the residuals (look at the above output from glance). In other words, the following calculation should be close to 1:

deviance(model_pois) / df.residual(model_pois)
[1] 12.44646

Thus, we have over-dispersion. One option is to use a negative binomial distribution.

library(MASS)

Attaching package: 'MASS'
The following objects are masked _by_ '.GlobalEnv':

    birthwt, housing, quine
The following object is masked from 'package:dplyr':

    select
model_negbin <- glm.nb(Days ~ Eth + Sex + Age, data = quine)

model_negbin %>%
  glm_coef(labels = model_labels(model_negbin), se_rob = TRUE)

Parameter	Rate ratio	Pr(>\|z\|)
Constant	20.24 (12.72, 32.21)	< 0.001
Ethnicity: White	0.57 (0.41, 0.78)	< 0.001
Sex: Male	1.07 (0.78, 1.45)	0.688
Age group: F1	0.69 (0.42, 1.16)	0.16
Age group: F2	1.2 (0.72, 2)	0.492
Age group: F3	1.29 (0.74, 2.23)	0.369

model_negbin %>% glance()

null.deviance	df.null	logLik	AIC	BIC	deviance	df.residual
192.24	145	-547.83	1109.65	1130.54	167.86	140

Notice that age group is a factor with more than two levels and is significant:

model_negbin %>% Anova()

LR Chisq	Df	Pr(>Chisq)
12.66	1.00	0.00
0.15	1.00	0.70
9.48	3.00	0.02

Thus, we want to report confidence intervals and \(p\)-values adjusted for multiple comparisons.

Effect plot:

model_negbin %>%
  plot_model(type = "pred", terms = c("Age", "Eth"), 
           dot.size = 1.5, title = "")

emmip(model_negbin, Eth ~ Age|Sex) %>%
  gf_labs(y = "Number of absent days", x = "Age group", col = "Ethnicity")

4.2 Adjusting CIs and p-values for multiple comparisons

We adjust for multiple comparisons:

multiple(model_negbin, ~ Age|Eth)$df

contrast	Eth	ratio	SE	z.ratio	p.value	lower.CL	upper.CL
F1 / F0	Aboriginal	0.69	0.16	-1.57	0.40	0.38	1.26
F2 / F0	Aboriginal	1.20	0.28	0.77	0.86	0.66	2.17
F2 / F1	Aboriginal	1.73	0.35	2.65	0.04	1.02	2.92
F3 / F0	Aboriginal	1.29	0.31	1.04	0.72	0.69	2.40
F3 / F1	Aboriginal	1.86	0.40	2.89	0.02	1.07	3.21
F3 / F2	Aboriginal	1.08	0.23	0.34	0.99	0.62	1.88
F1 / F0	White	0.69	0.16	-1.57	0.40	0.38	1.26
F2 / F0	White	1.20	0.28	0.77	0.86	0.66	2.17
F2 / F1	White	1.73	0.35	2.65	0.04	1.02	2.92
F3 / F0	White	1.29	0.31	1.04	0.72	0.69	2.40
F3 / F1	White	1.86	0.40	2.89	0.02	1.07	3.21
F3 / F2	White	1.08	0.23	0.34	0.99	0.62	1.88

We can see the comparison graphically with:

multiple(model_negbin, ~ Age|Eth)$fig_ci %>%
  gf_labs(x = "IRR")

5 Survival Analysis

We will use an example on the effect of thiotepa versus placebo on the development of bladder cancer.

data(bladder)
bladder <- bladder %>%
  mutate(times = stop,
         rx = factor(rx, labels=c("Placebo", "Thiotepa"))
         ) %>%
  var_labels(times = "Survival time",
             rx = "Treatment")

5.1 Parametric method

model_surv <- survreg(Surv(times, event) ~ rx, data = bladder)

Using robust standard errors:

model_surv %>%
  glm_coef(labels = c("Treatment: Thiotepa/Placebo", "Scale"), se_rob = TRUE)

Parameter	Survival time ratio	Pr(>\|z\|)
Treatment: Thiotepa/Placebo	1.64 (0.89, 3.04)	0.116
Scale	1 (0.85, 1.18)	0.992

In this example the scale parameter is not statistically different from one, meaning hazard is constant and thus, we can use the exponential distribution:

model_exp <- survreg(Surv(times, event) ~ rx, data = bladder, dist = "exponential")

model_exp %>%
  glm_coef(labels = c("Treatment: Thiotepa/Placebo"), se_rob = TRUE)

Parameter	Survival time ratio	Pr(>\|z\|)
Treatment: Thiotepa/Placebo	1.64 (0.85, 3.16)	0.139

Interpretation: Patients receiving Thiotepa live on average 64% more than those in the Placebo group.

Using naive standard errors:

model_exp %>%
  glm_coef(labels = c("Treatment: Thiotepa/Placebo"))

Parameter	Survival time ratio	Pr(>\|z\|)
Treatment: Thiotepa/Placebo	1.64 (1.11, 2.41)	0.012

model_exp %>%
  plot_model(type = "pred", terms = ~ rx, dot.size = 1.5, title = "") %>%
  gf_labs(y = "Survival time", x = "Treatment", title = "")

5.2 Cox proportional hazards regression

model_cox <-  coxph(Surv(times, event) ~ rx, data = bladder)

model_cox %>%
  glm_coef(labels = c("Treatment: Thiotepa/Placebo"))

Parameter	Hazard ratio	Pr(>\|z\|)
Treatment: Thiotepa/Placebo	0.64 (0.44, 0.94)	0.02

Interpretation: Patients receiving Thiotepa are 64% less likely of dying than those in the Placebo group.

model_cox %>%
  plot_model(type = "pred", terms = ~ rx, dot.size = 1.5, 
           title = "") %>%
  gf_labs(x = "Treatment", title = "")

6 Mixed Linear Regression Models

6.1 Continuous outcomes

We look at the relationship between sex and age on the distance from the pituitary to the pterygomaxillary fissure (mm).

library(nlme)

Attaching package: 'nlme'
The following objects are masked from 'package:ordinal':

    VarCorr, ranef
The following object is masked from 'package:dplyr':

    collapse
data(Orthodont)

Orthodont <- Orthodont %>%
  var_labels(
    distance = "Pituitary distance (mm)",
    age = "Age (years)"
    )

We fit the model:

model_lme <- lme(distance ~ Sex * I(age - mean(age, na.rm = TRUE)), random = ~ 1|Subject, 
                 method = "ML", data = Orthodont)

model_lme %>%
  glm_coef(labels = c(
    "Constant", 
    "Sex: female-male", 
    "Age (years)",
    "Sex:Age interaction"
    ))

Parameter	Coefficient	Pr(>\|t\|)
Constant	24.97 (24.03, 25.9)	< 0.001
Sex: female-male	-2.32 (-3.78, -0.86)	0.005
Age (years)	0.78 (0.63, 0.94)	< 0.001
Sex:Age interaction	-0.3 (-0.54, -0.07)	0.015

Effect plot:

model_lme %>%
  plot_model("pred", terms = age ~ Sex, 
           show.data = TRUE, jitter = 0.1, dot.size = 1.5) %>%
  gf_labs(y = get_label(Orthodont$distance), x = "Age (years)", title = "")

Regression Examples

Josie Athens

2020-07-07

1 Introduction

2 Multiple Linear Regression

3 Logistic Regression

3.1 Matched Case-Control Studies: Conditional Logistic Regression

3.2 Ordinal Logistic Regression

4 Poisson Regression

4.1 Negative-binomial

4.2 Adjusting CIs and p-values for multiple comparisons

5 Survival Analysis

5.1 Parametric method

5.2 Cox proportional hazards regression

6 Mixed Linear Regression Models

6.1 Continuous outcomes