The gim
package is used to integrate summary data reported to literature (external studies) with your own raw data (internal study). In this short tutorial, I will use examples to show various applications with gim
. To use gim
, you will have to find out what models are fitted for those external summary data, while it is up to you to chose the model to be fitted on your own internal data. Keep in mind that the parameters in the model fitted on internal data are of interest in gim
analysis.
The gim
package supports flexible ways to specify models fitted for the internal data and those of external studies. We use formula
to specify a linear or logistic regression model for the internal raw data, and consider this model as the true underlying model. This is important as gim
will calculate the standard error of its estimations under this model. Also, only the estimations of the coefficients in this model are of your interest. Currently, formula
accept arithmetic expressions, e.g., log(x)
, or other operators like *
for interactions. For example, y~as.factor(x1) * I(x2 >0)-1
is a valid formula for gim
.
Three options are supported for family
: gaussian
for linear regression; binomial
for logistic regression fitted on random sample; case-control
for logistic regression fitted on case-control data. Note that the case/control ratio can vary among case-control datasets. It is always benefitial to use case-control
over binomial
even if your data for binary outcome is indeed a random sample from the population.
Your individual-level internal data should be passed through data
. Note that every variable in formula
should be present in data
. gim
discards incomplete samples from analysis.
gim
uses model
, a list of list, to organize summary information from external studies. Please use ?gim
for more details of model
. A rule of thumb is that if a set of coefficients are estimated from a fitted model, then one should pack them as an element into model
. Two fitted external models could use the same or partially overlapping datasets, however, estimations from these two models should be stored as two separate elements in model
as you do have fitted two models. Please refer to examples below.
External studies may or may not share data when models being fitted. If summary data from k
external models are stored in model
, then nsample
should be a k
by k
matrix, with its (i,j)
element representing the numbers of shared samples being used when fitting the two models. If models are fitted on independent datasets, then the corresponding elements in nsample
should be zeros. The diagonal elements are numbers of samples used in fitting each of the external models. When analyzing case-control data, we have to specify ncase
and nctrl
instead, and leave nsample
to be NULL
, the default value.
gim
is able to handle the senario when internal and external studies are conducted on different populations. In that case, an extra reference set is required for making a valid inference on model parameters, otherwise, as shown in our paper, the estimation could be biased and the standard error of estimation may be underestimated. The reference set ref
is a data frame consisting variables used in formula
and model[[i]]$formula
. Unused variables in ref
would be discarded automatically. gim
does not assume an outcome variable is present in ref
, that is what a reference means. Such a reference set is used to estimate the distribution of covariates in the external population.
It is tricky to determine if data
or ref
contain ALL necessary variables when invoking gim
. If you encounter error message when calling gim
, try
glm(formula, family, data)
and
for(m in model){
glm(m$formula, family, data)
}
to find out a full list of those variables. You may also add an artificial outcome into ref
to replace data
in codes above. If you still encounter any error message when running those codes, then very likely some variables are missing in data
or ref
.
?gim
provides a good example showing how to use gim
. Here we give some more specified examples that could be of interest in practice.
Many research articles in epidemiology usually provide a table in which two-way contingency tables are listed for outcome versus several population characteristics. For example, assume that in a paper about cancer, it may gives the following table
N <- 800
set.seed(0)
sex <- sample(c('F', 'M'), N, TRUE, c(.4, .6))
snp <- rbinom(N, 2, c(.4, .3, .3))
age <- runif(N, 20, 60)
pr <- plogis(.1 + .5 * I(sex == 'F') + .5 * I(age > 40) - .2 * snp)
cancer <- rbinom(N, 1, pr)
ext <- data.frame(cancer, sex, snp, age, stringsAsFactors = FALSE)
Case | Control | ||
---|---|---|---|
Sex | M | 261 | 222 |
F | 216 | 101 | |
Age | < 30 | 127 | 92 |
[30,50] | 214 | 159 | |
> 50 | 136 | 72 | |
SNP | 0 | 241 | 133 |
1 or 2 | 236 | 190 |
Note that the raw dataset ext
could not be obtained from literature. Instead, one can only calculate univariate odds ratio of sex based on numbers in the table. For age, the odds ratios are subject to indicators I(age < 30)
and I(age > 50)
. For SNP, the best we can have is the odds ratio in a dominant model. Statistically, these are equivalent to fitting univariate models on ext
as follows
m1 <- glm(cancer ~ sex, data = ext, family = "binomial")
m2 <- glm(cancer ~ I(age < 30) + I(age > 50), data = ext, family = "binomial")
m3 <- glm(cancer ~ I(snp == 0), data = ext, family = "binomial")
Sometimes even the contingency tables are also inaccessible, but only the odds ratios are mentioned in text of articles. We consider a more complicated senario in this example.
summary(m2)$coef
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.2970718 0.1047006 2.8373472 0.004549012
#> I(age < 30)TRUE 0.0253267 0.1723537 0.1469461 0.883174576
#> I(age > 50)TRUE 0.3389170 0.1794548 1.8885921 0.058946501
We can see that only I(age > 50)
shows significant association with cancer, researchers may only report its log odds ratio 0.34 in their articles. We thus illustrate gim
by ignoring estimation of intercept in all the three models above. We also assume that only the log odds ratio of I(age > 50)
is given to gim
, while that of I(age < 30)
is not. We create model
as
model1 <- list(formula = 'cancer ~sex',
info = data.frame(var = names(coef(m1))[-1],
bet = coef(m1)[-1], stringsAsFactors = FALSE))
model2 <- list(formula = 'cancer~I(age< 30) + I(age > 50)',
info = data.frame(var = names(coef(m2))[3],
bet = coef(m2)[3], stringsAsFactors = FALSE))
model3 <- list(formula = 'cancer ~ I(snp == 0)',
info = data.frame(var = names(coef(m3))[-1],
bet = coef(m3)[-1], stringsAsFactors = FALSE))
model <- list(model1, model2, model3)
model2
#> $formula
#> [1] "cancer~I(age< 30) + I(age > 50)"
#>
#> $info
#> var bet
#> I(age > 50)TRUE I(age > 50)TRUE 0.338917
For each external model, gim
needs to know the model used to fit external data (model[[i]]$formula
), as well as the estimation of coefficients (model[[i]]$info
). In this example, some coefficients (e.g. intercept and I(age < 30)
) are estimated but not given to gim
. This is a sweet spot of gim
that can work with partial summary information quite well. Note that it is Since all three external models are fitted on the same dataset, we specify nsample
for the three fitted models as
nsample <- matrix(N, 3, 3)
If we would consider external studies as case-control studies, which is what we recommend, we could instead specify
ncase <- matrix(sum(cancer), 3, 3)
nctrl <- matrix(sum(!cancer), 3, 3)
and leave nsample
to be NULL
.
Now we collect samples from an internal study. Note that we modify the model by using a differnt intercept 0.3 while keeping all other coefficients remain the same as before, because we want a case-control data of a different case/control ratio.
n <- 300
set.seed(1)
sex <- sample(c('F', 'M'), n, TRUE, c(.4, .6))
snp <- rbinom(n, 2, c(.4, .3, .3))
age <- runif(n, 20, 60)
pr <- plogis(.3 + .5 * I(sex == 'F') + .5 * I(age > 40) - .2 * snp)
cancer <- rbinom(n, 1, pr)
smoking <- sample(c(TRUE, FALSE), n, TRUE, c(.3, .7))
int <- data.frame(cancer, sex, snp, age, smoking, stringsAsFactors = FALSE)
In this internal study, we also collect a binary smoking
, although it does not affect the cancer risk in our setting. Now we can integrate internal data int
with external model
. Let’s assume that, based on some knowledges, we believe that a single cutoff for age at 40 should be used in the underlying model instead 30 and 50 used in external studies, and an additive model for SNP would be more of interest. We also want to investigate potential effect of smoking status. So we assume the underlying true model would be cancer ~ I(sex == "F") + I(age > 40) + snp + smoking
, and invoke gim
as follows
library(gim)
fit1 <- gim(cancer ~ I(sex == "F") + I(age > 40) + snp + smoking,
"case-control", int, model,
ncase = ncase, nctrl = nctrl)
summary(fit1)
#> Call:
#> gim.default(formula = cancer ~ I(sex == "F") + I(age > 40) +
#> snp + smoking, family = "case-control", data = int, model = model,
#> ncase = ncase, nctrl = nctrl)
#>
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.23729 0.15403 -1.5405 0.1234325
#> I(sex == "F")TRUE 0.55741 0.14218 3.9204 8.839e-05 ***
#> I(age > 40)TRUE 0.67326 0.20335 3.3109 0.0009301 ***
#> snp -0.31230 0.10497 -2.9752 0.0029278 **
#> smokingTRUE -0.15390 0.27265 -0.5645 0.5724365
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
One can compare the result with the one from glm
fitted on int
, which has higher standard errors on its estimations.
fit0 <- glm(cancer ~ I(sex =="F") + I(age>40) +snp + smoking,
"binomial", int)
summary(fit0)$coef
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.4469066 0.2417339 1.8487546 0.064493257
#> I(sex == "F")TRUE 0.4094388 0.2636631 1.5528865 0.120450258
#> I(age > 40)TRUE 0.6681712 0.2526914 2.6442180 0.008187991
#> snp -0.2303803 0.1784850 -1.2907543 0.196788903
#> smokingTRUE -0.1550470 0.2727146 -0.5685323 0.569673562
The following table compares models fitted on internal and external datasets:
External | ||||
---|---|---|---|---|
Internal | model 1 | model 2 | model 3 | |
Sex | I(sex == 'F') |
I(sex == 'M') |
||
Age | I(age > 40) |
I(age < 30) |
||
I(age > 50) |
||||
SNP | snp |
I(snp == 0) |
||
Smoking | smoking |
In this example, we used the feature in gim
that supports flexible formula. One can, however, create variable instead using such a feature. For example, we can add columns to int
as follows. Exactly the same result would be returned.
int$sex_F <- ifelse(int$sex == "F", 1, 0)
int$age_gt_40 <- ifelse(int$age > 40, 1, 0)
fit2 <- gim(cancer ~ sex_F + age_gt_40 + snp + smoking,
"case-control", int, model,
ncase = ncase, nctrl = nctrl)
summary(fit2)
#> Call:
#> gim.default(formula = cancer ~ sex_F + age_gt_40 + snp + smoking,
#> family = "case-control", data = int, model = model, ncase = ncase,
#> nctrl = nctrl)
#>
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.23729 0.15403 -1.5405 0.1234325
#> sex_F 0.55741 0.14218 3.9204 8.839e-05 ***
#> age_gt_40 0.67326 0.20335 3.3109 0.0009301 ***
#> snp -0.31230 0.10497 -2.9752 0.0029278 **
#> smokingTRUE -0.15390 0.27265 -0.5645 0.5724365
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The list model
may looks complicated. For example,
model[[1]]
#> $formula
#> [1] "cancer ~sex"
#>
#> $info
#> var bet
#> sexM sexM -0.5983149
So where does the sexM
come from? It is because that gim
parses model[[1]]$formula
and automatically creates a dummy variable for I(sex == "M")
. R
names such a dummy variable to be sexM
, and in this example, we put this name to model[[1]]$info$var
. As long as sex
could be found in int
, gim
is able to define a dummy variable sexM
itself and move everything forward smoothly. We can, however, use a more straight way to invoke gim
. For example, we add a column dummy_sex
to int
int$dummy_sex <- ifelse(int$sex == "M", 1, 0)
model1 <- list(formula = 'cancer ~ dummy_sex',
info = data.frame(var = 'dummy_sex',
bet = coef(m1)[-1], stringsAsFactors = FALSE))
model1
#> $formula
#> [1] "cancer ~ dummy_sex"
#>
#> $info
#> var bet
#> sexM dummy_sex -0.5983149
model <- list(model1, model2, model3)
fit3 <- gim(cancer ~ I(sex == "F") + I(age > 40) + snp + smoking,
"case-control", int, model,
ncase = ncase, nctrl = nctrl)
summary(fit3)
#> Call:
#> gim.default(formula = cancer ~ I(sex == "F") + I(age > 40) +
#> snp + smoking, family = "case-control", data = int, model = model,
#> ncase = ncase, nctrl = nctrl)
#>
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.23729 0.15403 -1.5405 0.1234325
#> I(sex == "F")TRUE 0.55741 0.14218 3.9204 8.839e-05 ***
#> I(age > 40)TRUE 0.67326 0.20335 3.3109 0.0009301 ***
#> snp -0.31230 0.10497 -2.9752 0.0029278 **
#> smokingTRUE -0.15390 0.27265 -0.5645 0.5724365
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The result is still the same. Note that row name in model[[1]]$info
does not matter.
This example illustrates several features of gim
:
nsample
or ncase
, nctrl
gim
also provides generic accessor functions coef
, confint
and vcov
for extract information from its returned object.
coef(fit1)
#> (Intercept) I(sex == "F")TRUE I(age > 40)TRUE snp
#> -0.2372865 0.5574127 0.6732635 -0.3123015
#> smokingTRUE
#> -0.1539044
confint(fit1, level = 0.9)
#> 5 % 95 %
#> (Intercept) -0.4906427 0.01606971
#> I(sex == "F")TRUE 0.3235451 0.79128019
#> I(age > 40)TRUE 0.3387818 1.00774520
#> snp -0.4849581 -0.13964487
#> smokingTRUE -0.6023813 0.29457241
all(diag(vcov(fit1))^.5 == summary(fit1)$coef[, 2])
#> [1] TRUE