This vignette covers the intricacies of transformations and link functions in emmeans.
Consider the same example with the pigs
dataset that is used in many of these vignettes:
pigs.lm <- lm(log(conc) ~ source + factor(percent), data = pigs)
This model has two factors, source
and percent
(coerced to a factor), as predictors; and log-transformed conc
as the response. Here we obtain the EMMs for source
, examine its structure, and finally produce a summary, including a test against a null value of log(35):
pigs.emm.s <- emmeans(pigs.lm, "source")
str(pigs.emm.s)
## 'emmGrid' object with variables:
## source = fish, soy, skim
## Transformation: "log"
summary(pigs.emm.s, infer = TRUE, null = log(35))
## source emmean SE df lower.CL upper.CL null t.ratio p.value
## fish 3.39 0.0367 23 3.32 3.47 3.56 -4.385 0.0002
## soy 3.67 0.0374 23 3.59 3.74 3.56 2.988 0.0066
## skim 3.80 0.0394 23 3.72 3.88 3.56 6.130 <.0001
##
## Results are averaged over the levels of: percent
## Results are given on the log (not the response) scale.
## Confidence level used: 0.95
Now suppose that we want the EMMs expressed on the same scale as conc
. This can be done by adding type = "response"
to the summary()
call:
summary(pigs.emm.s, infer = TRUE, null = log(35), type = "response")
## source response SE df lower.CL upper.CL null t.ratio p.value
## fish 29.8 1.09 23 27.6 32.1 35 -4.385 0.0002
## soy 39.1 1.47 23 36.2 42.3 35 2.988 0.0066
## skim 44.6 1.75 23 41.1 48.3 35 6.130 <.0001
##
## Results are averaged over the levels of: percent
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
## Tests are performed on the log scale
Note: Looking ahead, this output is compared later in this vignette with a bias-adjusted version.
Dealing with transformations in emmeans is somewhat complex, due to the large number of possibilities. But the key is understanding what happens, when. These results come from a sequence of steps. Here is what happens (and doesn’t happen) at each step:
log(conc)
model. The fact that a log transformation is used is recorded, but nothing else is done with that information.percent
levels, for each source
, to obtain the EMMs for source
– still on the log(conc)
scale.log(conc)
scale.conc
scale.conc
scale using the delta method. These SEs were not used in constructing the tests and confidence intervals.This choice of timing is based on the idea that the model is right. In particular, the fact that the response is transformed suggests that the transformed scale is the best scale to be working with. In addition, the model specifies that the effects of source
and percent
are linear on the transformed scale; inasmuch as marginal averaging to obtain EMMs is a linear operation, that averaging is best done on the transformed scale. For those two good reasons, back-transforming to the response scale is delayed until the very end by default.
As well-advised as it is, some users may not want the default timing of things. The tool for changing when back-transformation is performed is the regrid()
function – which, with default settings of its arguments, back-transforms an emmGrid
object and adjusts everything in it appropriately. For example:
str(regrid(pigs.emm.s))
## 'emmGrid' object with variables:
## source = fish, soy, skim
summary(regrid(pigs.emm.s), infer = TRUE, null = 35)
## source response SE df lower.CL upper.CL null t.ratio p.value
## fish 29.8 1.09 23 27.5 32.1 35 -4.758 0.0001
## soy 39.1 1.47 23 36.1 42.2 35 2.827 0.0096
## skim 44.6 1.75 23 40.9 48.2 35 5.446 <.0001
##
## Results are averaged over the levels of: percent
## Confidence level used: 0.95
Notice that the structure no longer includes the transformation. That’s because it is no longer relevant; the reference grid is on the conc
scale, and how we got there is now forgotten. Compare this summary()
result with the preceding one, and note the following:
Understood, right? But think carefully about how these EMMs were obtained. They are back-transformed from pigs.emm.s
, in which the marginal averaging was done on the log scale. If we want to back-transform before doing the averaging, we need to call regrid()
after the reference grid is constructed but before the averaging takes place:
pigs.rg <- ref_grid(pigs.lm)
pigs.remm.s <- emmeans(regrid(pigs.rg), "source")
summary(pigs.remm.s, infer = TRUE, null = 35)
## source response SE df lower.CL upper.CL null t.ratio p.value
## fish 30.0 1.10 23 27.7 32.2 35 -4.585 0.0001
## soy 39.4 1.49 23 36.3 42.5 35 2.927 0.0076
## skim 44.8 1.79 23 41.1 48.5 35 5.486 <.0001
##
## Results are averaged over the levels of: percent
## Confidence level used: 0.95
These results all differ from either of the previous two summaries – again, because the averaging is done on the conc
scale rather than the log(conc)
scale.
Note: For those who want to routinely back-transform before averaging, the transform
argument in ref_grid()
simplifies this. The first two steps above could have been done more easily as follows:
pigs.remm.s <- emmeans(pigs.lm, "source", transform = "response")
But don’t get transform
and type
confused. The transform
argument is passed to regrid()
after the reference grid is constructed, whereas the type
argument is simply remembered and used by summary()
. So a similar-looking call:
emmeans(pigs.lm, "source", type = "response")
will compute the results we have seen for pigs.emm.s
– back-transformed after averaging on the log scale.
Remember again: When it comes to transformations, timing is everything.
Exactly the same ideas we have presented for response transformations apply to generalized linear models having non-identity link functions. As far as emmeans is concerned, there is no difference at all.
To illustrate, consider the neuralgia
dataset provided in the package. These data come from an experiment reported in a SAS technical report where different treatments for neuralgia are compared. The patient’s sex is an additional factor, and their age is a covariate. The response is Pain
, a binary variable on whether or not the patient reports neuralgia pain after treatment. The model suggested in the SAS report is equivalent to the following. We use it to obtain estimated probabilities of experiencing pain:
neuralgia.glm <- glm(Pain ~ Treatment * Sex + Age, family = binomial(), data = neuralgia)
neuralgia.emm <- emmeans(neuralgia.glm, "Treatment", type = "response")
## NOTE: Results may be misleading due to involvement in interactions
neuralgia.emm
## Treatment prob SE df asymp.LCL asymp.UCL
## A 0.211 0.1109 Inf 0.0675 0.497
## B 0.121 0.0835 Inf 0.0285 0.391
## P 0.866 0.0883 Inf 0.5927 0.966
##
## Results are averaged over the levels of: Sex
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
(The note about the interaction is discussed shortly.) Note that the averaging over Sex
is done on the logit scale, before the results are back-transformed for the summary. We may use pairs()
to compare these estimates; note that logits are logs of odds; so this is another instance where log-differences are back-transformed – in this case to odds ratios:
pairs(neuralgia.emm, reverse = TRUE)
## contrast odds.ratio SE df z.ratio p.value
## B / A 0.513 0.515 Inf -0.665 0.7837
## P / A 24.234 25.142 Inf 3.073 0.0060
## P / B 47.213 57.242 Inf 3.179 0.0042
##
## Results are averaged over the levels of: Sex
## P value adjustment: tukey method for comparing a family of 3 estimates
## Tests are performed on the log odds ratio scale
So there is evidence of considerably more pain being reported with placebo (treatment P
) than with either of the other two treatments. The estimated odds of pain with B
are about half that for A
, but this finding is not statistically significant. (The odds that this is a made-up dataset seem quite high, but that finding is strictly this author’s impression.)
Observe that there is a note in the output for neuralgia.emm
that the results may be misleading. It is important to take it seriously, because if two factors interact, it may be the case that marginal averages of predictions don’t reflect what is happening at any level of the factors being averaged over. To find out, look at an interaction plot of the fitted model:
emmip(neuralgia.glm, Sex ~ Treatment)
There is no practical difference between females and males in the patterns of response to Treatment
; so I think most people would be quite comfortable with the marginal results that are reported earlier.
It is possible to have a generalized linear model with a non-identity link and a response transformation. Here is an example, with the built-in wapbreaks
dataset:
warp.glm <- glm(sqrt(breaks) ~ wool*tension, family = Gamma, data = warpbreaks)
ref_grid(warp.glm)
## 'emmGrid' object with variables:
## wool = A, B
## tension = L, M, H
## Transformation: "inverse"
## Additional response transformation: "sqrt"
The canonical link for a gamma model is the reciprocal (or inverse); and there is the square-root response transformation besides. If we choose type = "response"
in summarizing, we undo both transformations:
emmeans(warp.glm, ~ tension | wool, type = "response")
## wool = A:
## tension response SE df asymp.LCL asymp.UCL
## L 42.9 5.24 Inf 33.2 53.7
## M 23.3 2.85 Inf 18.0 29.2
## H 23.6 2.88 Inf 18.3 29.6
##
## wool = B:
## tension response SE df asymp.LCL asymp.UCL
## L 27.4 3.35 Inf 21.3 34.4
## M 28.1 3.43 Inf 21.8 35.2
## H 18.5 2.26 Inf 14.3 23.2
##
## Confidence level used: 0.95
## Intervals are back-transformed from the sqrt scale
What happened here is first the linear predictor was back-transformed from the link scale (inverse); then the squares were obtained to back-transform the rest of the way. It is possible to undo the link, and not the response transformation:
emmeans(warp.glm, ~ tension | wool, type = "unlink")
## wool = A:
## tension response SE df asymp.LCL asymp.UCL
## L 6.55 0.400 Inf 5.85 7.44
## M 4.83 0.295 Inf 4.31 5.48
## H 4.86 0.297 Inf 4.34 5.52
##
## wool = B:
## tension response SE df asymp.LCL asymp.UCL
## L 5.24 0.320 Inf 4.68 5.95
## M 5.30 0.324 Inf 4.73 6.02
## H 4.30 0.263 Inf 3.84 4.89
##
## Confidence level used: 0.95
## Intervals are back-transformed from the inverse scale
It is not possible to undo the response transformation and leave the link in place, because the response was transform first, then the link model was applied; we have to undo those in reverse order to make sense.
One may also use "unlink"
as a transform
argument in regrid()
or through ref_grid()
.
The make.tran()
function provides several special transformations and sets things up so they can be handled in emmeans with relative ease. (See help("make.tran", "emmeans")
for descriptions of what is available.) make.tran()
works much like stats::make.link()
in that it returns a list of functions linkfun()
, linkinv()
, etc. that serve in managing results on a transformed scale. The difference is that most transformations with make.tran()
require additional arguments.
To use this capability in emmeans()
, it is fortuitous to first obtain the make.tran()
result, and then to use it as the enclosing environment for fitting the model, with linkfun
as the transformation. For example, suppose the response variable is a percentage and we want to use the response transformation \(\sin^{-1}\sqrt{y/100}\). Then proceed like this:
tran <- make.tran("asin.sqrt", 100)
my.model <- with(tran,
lmer(linkfun(percent) ~ treatment + (1|Block), data = mydata))
Subsequent calls to ref_grid()
, emmeans()
, regrid()
, etc. will then be able to access the transformation information correctly.
The help page for make.tran()
has an example like this using a Box-Cox transformation.
It is not at all uncommon to fit a model using statements like the following:
mydata <- transform(mydata, logy.5 = log(yield + 0.5))
my.model <- lmer(logy.5 ~ treatment + (1|Block), data = mydata)
In this case, there is no way for ref_grid()
to figure out that a response transformation was used. What can be done is to update the reference grid with the required information:
my.rg <- update(ref_grid(my.model), tran = make.tran("genlog", .5))
Subsequently, use my.rg
in place of my.model
in any emmeans()
analyses, and the transformation information will be there.
For standard transformations (those in stats::make.link()
), just give the name of the transformation; e.g.,
model.rg <- update(ref_grid(model), tran = "sqrt")
As can be seen in the initial pigs.lm
example in this vignette, certain straightforward response transformations such as log
, sqrt
, etc. are automatically detected when emmeans()
(really, ref_grid()
) is called on the model object. In fact, scaling and shifting is supported too; so the preceding example with my.model
could have been done more easily by specifying the transformation directly in the model formula:
my.better.model <- lmer(log(yield + 0.5) ~ treatment + (1|Block), data = mydata)
The transformation would be auto-detected, saving you the trouble of adding it later. Similarly, a response transformation of 2 * sqrt(y + 1)
would be correctly auto-detected. A model with a linearly transformed response, e.g. 4*(y - 1)
, would not be auto-detected, but 4*I(y + -1)
would be interpreted as 4*identity(y + -1)
. Parsing is such that the response expression must be of the form mult * fcn(resp + const)
; operators of -
and /
are not recognized.
The regrid()
function makes it possible to fake a log transformation of the response. Why would you want to do this? So that you can make comparisons using ratios instead of differences.
Consider the pigs
example once again, but suppose we had fitted a model with a square-root transformation instead of a log:
pigroot.lm <- lm(sqrt(conc) ~ source + factor(percent), data = pigs)
piglog.emm.s <- regrid(emmeans(pigroot.lm, "source"), transform = "log")
confint(piglog.emm.s, type = "response")
## source response SE df lower.CL upper.CL
## fish 29.8 1.32 23 27.2 32.7
## soy 39.2 1.54 23 36.2 42.6
## skim 45.0 1.74 23 41.5 48.7
##
## Results are averaged over the levels of: percent
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
pairs(piglog.emm.s, type = "response")
## contrast ratio SE df t.ratio p.value
## fish / soy 0.760 0.0454 23 -4.591 0.0004
## fish / skim 0.663 0.0391 23 -6.965 <.0001
## soy / skim 0.872 0.0469 23 -2.548 0.0457
##
## Results are averaged over the levels of: percent
## P value adjustment: tukey method for comparing a family of 3 estimates
## Tests are performed on the log scale
These results are not identical, but very similar to the back-transformed confidence intervals above for the EMMs and the pairwise ratios in the “comparisons” vignette, where the fitted model actually used a log response.
It is possible to fake transformations other than the log. Just use the same method, e.g.
regrid(emm, transform = "probit")
would re-grid the existing emm
to the probit scale. Note that any estimates in emm
outside of the interval \((0,1)\) will be flagged as non-estimable.
The section on standardized responses gives an example of reverse-engineering a standardized response transformation in this way.
In some disciplines, it is common to fit a model to a standardized response variable. R’s base function scale()
makes this easy to do; but it is important to notice that scale(y)
is more complicated than, say, sqrt(y)
, because scale(y)
requires all the values of y
in order to determine the centering and scaling parameters. The ref_grid()
function (called by `emmeans() and others) tries to detect the scaling parameters. To illustrate:
fiber.lm <- lm(scale(strength) ~ machine * scale(diameter), data = fiber)
emmeans(fiber.lm, "machine") # on the standardized scale
## machine emmean SE df lower.CL upper.CL
## A 0.00444 0.156 9 -0.349 0.358
## B 0.28145 0.172 9 -0.109 0.672
## C -0.33473 0.194 9 -0.774 0.105
##
## Results are given on the scale(40.2, 4.97) (not the response) scale.
## Confidence level used: 0.95
emmeans(fiber.lm, "machine", type = "response") # strength scale
## machine response SE df lower.CL upper.CL
## A 40.2 0.777 9 38.5 42.0
## B 41.6 0.858 9 39.7 43.5
## C 38.5 0.966 9 36.3 40.7
##
## Confidence level used: 0.95
## Intervals are back-transformed from the scale(40.2, 4.97) scale
More interesting (and complex) is what happens with emtrends()
. Without anything fancy added, we have
emtrends(fiber.lm, "machine", var = "diameter")
## machine diameter.trend SE df lower.CL upper.CL
## A 0.222 0.0389 9 0.1339 0.310
## B 0.172 0.0450 9 0.0705 0.274
## C 0.174 0.0418 9 0.0791 0.268
##
## Confidence level used: 0.95
These slopes are (change in scale(strength)
) / (change in diameter
); that is, we didn’t do anything to undo the response transformation, but the trend is based on exactly the variable specified, diameter
. To get (change in strength
) / (change in diameter
), we need to undo the response transformation, and that is done via transform
(which invokes regrid()
after the reference grid is constructed):
emtrends(fiber.lm, "machine", var = "diameter", transform = "response")
## machine diameter.trend SE df lower.CL upper.CL
## A 1.104 0.194 9 0.666 1.54
## B 0.857 0.224 9 0.351 1.36
## C 0.864 0.208 9 0.394 1.33
##
## Confidence level used: 0.95
What if we want slopes for (change in scale(strength)
) / (change in scale(diameter)
)? This can be done, but it is necessary to manually specify the scaling parameters for diameter
.
with(fiber, c(mean = mean(diameter), sd = sd(diameter)))
## mean sd
## 24.133333 4.323799
emtrends(fiber.lm, "machine", var = "scale(diameter, 24.133, 4.324)")
## machine scale(diameter, 24.133, 4.324).trend SE df lower.CL upper.CL
## A 0.960 0.168 9 0.579 1.34
## B 0.745 0.195 9 0.305 1.19
## C 0.751 0.181 9 0.342 1.16
##
## Confidence level used: 0.95
This result is the one most directly related to the regression coefficients:
coef(fiber.lm)[4:6]
## scale(diameter) machineB:scale(diameter) machineC:scale(diameter)
## 0.9598846 -0.2148202 -0.2086880
There is a fourth possibility, (change in strength
) / (change in scale(diameter)
), that I leave to the reader.
Auto-detection of standardized responses is a bit tricky, and doesn’t always succeed. If it fails, a message is displayed and the transformation is ignored. In cases where it doesn’t work, we need to explicitly specify the transformation using make.tran()
. The methods are exactly as shown earlier in this vignette, so we show the code but not the results for a hypothetical example.
One method is to fit the model and then add the transformation information later. In this example, some.fcn
is a model-fitting function which for some reason doesn’t allow the scaling information to be detected.
mod <- some.fcn(scale(RT) ~ group + (1|subject), data = mydata)
emmeans(mod, "group", type = "response",
tran = make.tran("scale", y = mydata$RT))
The other, equivalent, method is to create the transformation object first and use it in fitting the model:
mod <- with(make.tran("scale", y = mydata$RT),
some.fcn(linkfun(RT) ~ group + (1|subject), data = mydata))
emmeans(mod, "group", type = "response")
An interesting twist on all this is the reverse situation: Suppose we fitted the model without the standardized response, but we want to kniow what the results would be if we had standardized. Here we reverse-engineer the fiber.lm
example above:
fib.lm <- lm(strength ~ machine * diameter, data = fiber)
# On raw scale:
emmeans(fib.lm, "machine")
## machine emmean SE df lower.CL upper.CL
## A 40.2 0.777 9 38.5 42.0
## B 41.6 0.858 9 39.7 43.5
## C 38.5 0.966 9 36.3 40.7
##
## Confidence level used: 0.95
# On standardized scale:
tran <- make.tran("scale", y = fiber$strength)
emmeans(fib.lm, "machine", transform = tran)
## machine emmean SE df lower.CL upper.CL
## A 0.00444 0.156 9 -0.349 0.358
## B 0.28145 0.172 9 -0.109 0.672
## C -0.33473 0.194 9 -0.774 0.105
##
## Results are given on the scale(40.2, 4.97) (not the response) scale.
## Confidence level used: 0.95
In the latter call, the transform
argument causes regrid()
to be called after the reference grid is constructed.
So far, we have discussed ideas related to back-transforming results as a simple way of expressing results on the same scale as the response. In particular, means obtained in this way are known as generalized means; for example, a log transformation of the response is associated with geometric means. When the goal is simply to make inferences about which means are less than which other means, and a response transformation is used, it is often acceptable to present estimates and comparisons of these generalized means. However, sometimes it is important to report results that actually do reflect expected values of the untransformed response. An example is a financial study, where the response is in some monetary unit. It may be convenient to use a response transformation for modeling purposes, but ultimately we may want to make financial projections in those same units.
In such settings, we need to make a bias adjustment when we back-transform, because any nonlinear transformation biases the expected values of statistical quantities. More specifically, suppose that we have a response \(Y\) and the transformed response is \(U\). To back-transform, we use \(Y = h(U)\); and using a Taylor approximation, \(Y \approx h(\eta) + h'(\eta)(U-\eta) + \frac12h''(\eta)(U-\eta)^2\), so that \(E(Y) \approx h(\eta) + \frac12h''(\eta)Var(U)\). This shows that the amount of needed bias adjustment is approximately \(\frac12h''(\eta)\sigma^2\) where \(\sigma\) is the error SD in the model for \(U\). It depends on \(\sigma\), and the larger this is, the greater the bias adjustment is needed. This second-order bias adjustment is what is currently used in the emmeans package when bias-adjustment is requested. There are better or exact adjustments for certain cases, and future updates may incorporate some of those.
Let us compare the estimates in the overview after we apply a bias adjustment. First, note that an estimate of the residual SD is available via the sigma()
function:
sigma(pigs.lm)
## [1] 0.115128
This estimate is used by default. The bias-adjusted EMMs for the sources are:
summary(pigs.emm.s, type = "response", bias.adj = TRUE)
## source response SE df lower.CL upper.CL
## fish 30.0 1.10 23 27.8 32.4
## soy 39.4 1.48 23 36.5 42.6
## skim 44.9 1.77 23 41.3 48.7
##
## Results are averaged over the levels of: percent
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
## Bias adjustment applied based on sigma = 0.11513
These estimates (and also their SEs) are slightly larger than we had without bias adjustment. They are estimates of the aritmetic mean responses, rather than the geometric means shown in the overview. Had the value of sigma
been larger, the adjustment would have been greater. You can experiment with this by adding a sigma =
argument to the above call.
At this point, it is important to point out that the above discussion focuses on response transformations, as opposed to link functions used in generalized linear models (GLMs). In an ordinary GLM, no bias adjustment is needed, nor is it appropriate, because the link function is just used to define a nonlinear relationship between the actual response mean \(\eta\) and the linear predictor. That is, the back-transformed parameter is already the mean.
To illustrate this, consider the InsectSprays
data in the datasets package. The response variable is a count, and there is one treatment, the spray that is used. Let us model the count as a Poisson variable with (by default) a log link; and obtain the EMMs, with and without a bias adjustment
ismod <- glm(count ~ spray, data = InsectSprays, family = poisson())
emmeans(ismod, "spray", type = "response", bias.adj = FALSE)
## spray rate SE df asymp.LCL asymp.UCL
## A 14.50 1.099 Inf 12.50 16.82
## B 15.33 1.130 Inf 13.27 17.72
## C 2.08 0.417 Inf 1.41 3.08
## D 4.92 0.640 Inf 3.81 6.35
## E 3.50 0.540 Inf 2.59 4.74
## F 16.67 1.179 Inf 14.51 19.14
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
emmeans(ismod, "spray", type = "response", bias.adj = TRUE)
## spray rate SE df asymp.LCL asymp.UCL
## A 25.30 1.918 Inf 21.81 29.35
## B 26.76 1.972 Inf 23.16 30.91
## C 3.64 0.727 Inf 2.46 5.38
## D 8.58 1.117 Inf 6.65 11.07
## E 6.11 0.942 Inf 4.51 8.26
## F 29.08 2.056 Inf 25.32 33.41
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
## Bias adjustment applied based on sigma = 1.2206
These are substantially different! Which is right? Well, due to the simple structure of this dataset, the estimates should be well in line with the simple observed mean counts:
with(InsectSprays, tapply(count, spray, mean))
## A B C D E F
## 14.500000 15.333333 2.083333 4.916667 3.500000 16.666667
This illustrates that it is the non-bias-adjusted results that are appropriate. Again, the point here is that a GLM does not have an additive error term, that the model is already formulated in terms of the mean, not some generalized mean. Users must be very careful with this! There is no way to automatically do the right thing.
Note that, in a generalized linear mixed model, including generalized estimating equations and such, there are additive random components involved, and then bias adjustment becomes appropriate.
Consider an example adapted from the help page for lme4::cbpp
. Contagious bovine pleuropneumonia (CBPP) is a disease in African cattle, and the dataset contains data on incidence of CBPP in several herds of cattle over four time periods. We will fit a mixed model that accounts for herd variations as well as overdispersion (variations larger than expected with a simple binomial model):
require(lme4)
cbpp <- transform(cbpp, unit = 1:nrow(cbpp))
cbpp.glmer <- glmer(cbind(incidence, size - incidence) ~ period +
(1 | herd) + (1 | unit),
family = binomial, data = cbpp)
emm <- emmeans(cbpp.glmer, "period")
summary(emm, type = "response")
## period prob SE df asymp.LCL asymp.UCL
## 1 0.1824 0.0442 Inf 0.1109 0.2852
## 2 0.0614 0.0230 Inf 0.0290 0.1252
## 3 0.0558 0.0220 Inf 0.0254 0.1182
## 4 0.0334 0.0172 Inf 0.0120 0.0894
##
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
The above summary reflects the back-transformed estimates, with no bias adjustment. However, the model estimates two independent sources of random variation that probably should be taken into account:
lme4::VarCorr(cbpp.glmer)
## Groups Name Std.Dev.
## unit (Intercept) 0.89107
## herd (Intercept) 0.18396
Notably, the over-dispersion SD is considerably greater than the herd SD. Suppose we want to estimate the marginal probabilities of CBPP incidence, averaged over herds and over-dispersion variations. For this purpose, we need the combined effect of these variations; so we compute the overall SD via the Pythagorean theorem:
total.SD = sqrt(0.89107^2 + 0.18396^2)
Accordingly, here are the bias-adjusted estimates of the marginal probabilities:
summary(emm, type = "response", bias.adjust = TRUE, sigma = total.SD)
## period prob SE df asymp.LCL asymp.UCL
## 1 0.2216 0.0462 Inf 0.1426 0.321
## 2 0.0823 0.0292 Inf 0.0400 0.159
## 3 0.0751 0.0282 Inf 0.0351 0.151
## 4 0.0458 0.0230 Inf 0.0168 0.117
##
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
## Bias adjustment applied based on sigma = 0.90986
These estimates are somewhat larger than the unadjusted estimates (actually, any estimates greater than 0.5 would have been adjusted downward). These adjusted estimates are more appropriate for describing the marginal incidence of CBPP for all herds. In fact, these estimates are fairly close to those obtained directly from the incidences in the data:
cases <- with(cbpp, tapply(incidence, period, sum))
trials <- with(cbpp, tapply(size, period, sum))
cases / trials
## 1 2 3 4
## 0.21942446 0.08018868 0.07106599 0.04516129
Left as an exercise: Revisit the InsectSprays
example, but (using similar methods to the above) create a unit
variable and fit an over-dispersion model. Compare the results with and without bias adjustment, and evaluate these results against the earlier results. This is simpler than the CBPP example because there is only one random effect.