This is an R package for fitting semiparametric dynamic frailty models with the EM algorithm. The hazard for individual \(j\) from cluster \(i\) is specified as: \[ \lambda_{ij}(t | Z_i(t)) = Z_i(t) \exp(\beta^\top x_{ij}(t)) \lambda_0(t). \] The model used here is described in detail in Putter & van Houwelingen (2015). The distribution of \(Z_i(t)\) is described by two parameters: \(\theta\), that is an inverse-variability parameter of \(Z_i(t)\) for a fixed \(t\), and \(\lambda\), that describes the autocorrelation of the process, so that for \(t_1 \leq t_2\) \[ \mathrm{cor}(Z_i(t_1), Z_i(t_2)) = \exp(\lambda (t_2 - t_1)). \]
The estimation process is that for fixed \((\theta, \lambda)\) the maximized profile likelihood is calculated, i.e. maximized with respect to \((\beta, \lambda_0)\). This profile likelihood is finally maximized itself.
The development version from GitHub
:
devtools::install_github("tbalan/dynfrail")
The following packages are needed to build dynfrail
:
install.packages(c("RcppArmadillo", "tibble", "magrittr", "dplyr", "tidyr"))
The functioning of the package is described in the documentation of the main fitting function, dynfrail()
.
dynfrail()
has a friendly syntax very similar to the frailtyEM
package: next to a formula
and data
argument, the distribution
argument is used to specify the distribution parameters and the control
parameter is used for controling the precision of the estimation.dynfrail_prep()
and dynfrail_fit()
are used internally by dynfrail()
but are made user-available. The first one prepares the input of dynfrail()
to make it suitable for the actual EM algorithm. The second one performs one EM algorithm for fixed \((\theta, \lambda)\) to estimate the maximum (\(\beta\), \(\lambda_0\)).We take the asthma
data set in the parfm
package. I take a subset where each individual is censored after 3 events.
library(parfm)
library(dplyr)
data(asthma)
small_asthma <-
asthma %>%
group_by(Patid) %>%
mutate(linenr = 1:n()) %>%
filter(linenr <= 3)
This is a recurrent events data set in Andersen-Gill format with one covariate (Drug
).
head(small_asthma)
## # A tibble: 6 x 7
## # Groups: Patid [2]
## Patid Begin End Status Drug Fevent linenr
## <int> <int> <int> <int> <int> <int> <int>
## 1 1 0 15 1 0 1 1
## 2 1 22 90 1 0 0 2
## 3 1 96 325 1 0 0 3
## 4 2 0 180 1 1 1 1
## 5 2 189 267 1 1 0 2
## 6 2 273 581 1 1 0 3
This is a fit with a gamma piecewise constant frailty with 2 intervals:
library(dynfrail)
m2 <- dynfrail(Surv(Begin, End, Status) ~ Drug + cluster(Patid), data = small_asthma,
distribution = dynfrail_dist(n_ints = 1))
m2
## Call:
## dynfrail(formula = Surv(Begin, End, Status) ~ Drug + cluster(Patid),
## data = small_asthma, distribution = dynfrail_dist(n_ints = 1))
##
## log-likelihood: -3098.97
## theta: 1.742121 || 1/theta: 0.5740131
## lambda: 0.1
##
## coef exp(coef) se(coef) z p
## Drug -0.17224 0.84178 0.11797 -1.46002 0.1443
From the output we see that the estimated frailty variance is around 0.48, and the estimated auto correlation between 10 days is \(\exp(-0.628 \times 10) \approx 0.002\).
An inverse Gaussian fit with 3 intervals for the frailty would works like this:
m3_ig <- dynfrail(Surv(Begin, End, Status) ~ Drug + cluster(Patid), data = small_asthma,
distribution = dynfrail_dist(dist = "pvf", n_ints = 2))
m3_ig
## Call:
## dynfrail(formula = Surv(Begin, End, Status) ~ Drug + cluster(Patid),
## data = small_asthma, distribution = dynfrail_dist(dist = "pvf",
## n_ints = 2))
##
## log-likelihood: -3099.887
## theta: 1.184186 || 1/theta: 0.8444616
## lambda: 0.09999966
##
## coef exp(coef) se(coef) z p
## Drug -0.15043 0.86034 0.11111 -1.35387 0.1758
The interpretation goes similar with that from the previous model.
The baseline hazard can be plotted like this:
with(m3_ig, plot(tev, cumsum(hazard), main = "Baseline cumulative hazard", ylab = "H0", xlab = "time"))
The empirical Bayes frailty estimates are stored in here:
m3_ig$frail_id %>%
select(id, tstart, tstop, frail) %>%
head()
## id tstart tstop frail
## 1 1 0 15 1.0766615
## 2 1 22 90 1.0766615
## 3 1 96 123 1.0766615
## 4 1 123 278 0.4800052
## 5 1 278 325 1.1675580
## 6 2 0 123 0.5418477
Let’s plot the frailty estimates of individual 1:
library(ggplot2)
m3_ig$frail_id %>%
filter(id == 1) %>%
ggplot() + geom_segment(aes(x = tstart, xend = tstop, y = frail, yend = frail)) +
ylim(c(0, 3)) +
theme_classic()
For the 3 indiviudals the estimated frailty looks like:
m3_ig$frail_id %>%
filter(id <= 3) %>%
mutate(id = as.factor(id)) %>%
ggplot() + geom_segment(aes(x = tstart, xend = tstop, y = frail, yend = frail, colour = id)) +
ylim(c(0, 3)) +
theme_classic()
Now for 5 piecewise-constant intervals, again inverse Gaussian frailty. This takes a while if you try to run it:
m5_ig <- dynfrail(Surv(Begin, End, Status) ~ Drug + cluster(Patid), data = small_asthma,
distribution = dynfrail_dist(dist = "pvf", n_ints = 4))
m5_ig
## Call:
## dynfrail(formula = Surv(Begin, End, Status) ~ Drug + cluster(Patid),
## data = small_asthma, distribution = dynfrail_dist(dist = "pvf",
## n_ints = 4))
##
## log-likelihood: -3096.869
## theta: 0.863205 || 1/theta: 1.158473
## lambda: 0.003033928
##
## coef exp(coef) se(coef) z p
## Drug -0.18569 0.83053 0.12955 -1.43338 0.1517
Now the first 3 individuals look like this:
m5_ig$frail_id %>%
filter(id <= 3) %>%
mutate(id = as.factor(id)) %>%
ggplot() + geom_segment(aes(x = tstart, xend = tstop, y = frail, yend = frail, colour = id)) +
ylim(c(0, 3)) +
theme_classic()
Look at the log-likelihoods:
c(m3_ig$loglik[2], m5_ig$loglik[2])
## [1] -3099.887 -3096.869
Seems that the model with 5 piecewise constant intervals has a higher log-likelihood than the one with 3 piecewise constant intervals.
We saw where the maximum likelihood in model m5
lies:
m5_ig
## Call:
## dynfrail(formula = Surv(Begin, End, Status) ~ Drug + cluster(Patid),
## data = small_asthma, distribution = dynfrail_dist(dist = "pvf",
## n_ints = 4))
##
## log-likelihood: -3096.869
## theta: 0.863205 || 1/theta: 1.158473
## lambda: 0.003033928
##
## coef exp(coef) se(coef) z p
## Drug -0.18569 0.83053 0.12955 -1.43338 0.1517
Say we want the likelihood for \(\theta = 3\) (variance of the frailty 0.33). This can be done in two steps:
args_5 <- dynfrail_prep(formula = Surv(Begin, End, Status) ~ Drug + cluster(Patid),
data = small_asthma, distribution = dynfrail_dist(dist = "pvf",
n_ints = 4))
lik_5 <- do.call(dynfrail_fit, c(logfrailtypar = list(c(log(3), m5_ig$loglambda)), args_5))
lik_5
## [1] 3106.414
Of course that the likelihood is smaller than the maximum likelihood. This is useful though; for example, in this way it can be seen how the likelihood varies in \(\lambda\). Furthermore, dynfrail_fit
may be plugges into any maximizer in this way.
Then the result is the same as the regular shared frailty model:
m1 <- dynfrail(Surv(Begin, End, Status) ~ Drug + cluster(Patid), data = small_asthma,
distribution = dynfrail_dist(n_ints = 0))
m1
## Call:
## dynfrail(formula = Surv(Begin, End, Status) ~ Drug + cluster(Patid),
## data = small_asthma, distribution = dynfrail_dist(n_ints = 0))
##
## log-likelihood: -3104.823
## theta: 2.122279 || 1/theta: 0.4711915
## lambda: 0.1
##
## coef exp(coef) se(coef) z p
## Drug -0.15740 0.85436 0.13008 -1.21001 0.2263
Compare with the (gamma) shared frailty model:
library(frailtyEM)
m1_sf <- emfrail(Surv(Begin, End, Status) ~ Drug + cluster(Patid), data = small_asthma)
summary(m1_sf)
## Call:
## emfrail(formula = Surv(Begin, End, Status) ~ Drug + cluster(Patid),
## data = small_asthma)
##
## Regression coefficients:
## coef exp(coef) se(coef) adjusted se z p
## Drug -0.15740 0.85436 0.13014 0.13025 -1.20951 0.2265
## Estimated distribution: gamma / left truncation: FALSE
##
## Fit summary:
## Commenges-Andersen test for heterogeneity: p-val 2.47e-09
## (marginal) no-frailty Log-likelihood: -3123.292
## (marginal) Log-likelihood: -3104.823
## LRT: 1/2 * pchisq(36.9), p-val 6.1e-10
##
## Frailty summary:
## theta = 2.122 (0.47) / 95% CI: [1.435, 3.508]
## variance = 0.471 / 95% CI: [0.285, 0.697]
## Kendall's tau: 0.191 / 95% CI: [0.125, 0.258]
## Median concordance: 0.187 / 95% CI: [0.121, 0.256]
## E[log Z]: -0.254 / 95% CI: [-0.387, -0.149]
## Var[log Z]: 0.599 / 95% CI: [0.329, 0.992]
## Confidence intervals based on the likelihood function
Note that in this case the likelihood is completely flat in \(\lambda\)! This can be checked as was shown in the previous part.
The information matrix is calculated with Louis’ formula, for a fixed \(\theta, \lambda\). Denoting all the other parameters by \(\gamma\), this means: \[ I = \mathrm{E} \left[ \frac{d^2}{d \gamma d \gamma^\top} l \right] - \mathrm{E}\left[ \frac{d}{d \gamma} \frac{d}{d \gamma^\top} \right] \] The first part is a matrix with the expectation of the second derivatives of the complete-data likelihood. That is: \[ l = \sum_k \left[ \sum_i \delta_{ki} ( \log h_{0ki} + \beta^\top x_{ki}) - \sum_l z_{ki} e^{\beta^\top x_{ki}} \Lambda_{0ki} \right] \] where \(k\) is for individual/cluster and \(i\) is for a certain line in the data set. The lines are not in the original data set, but rather in a data set that was splitted at the time points of where the frailty changes. Then \(z_{ki}\) is the frailty on that line and \(\Lambda_{0ki}\) is the cumulative baseline hazard for that line.
The derivatives go as follows: \[
\frac{\partial l}{\partial \beta} = \sum_k \left[\sum_i x_{ki} \delta_{ki} - \sum_i x_{ki} z_{ki} e^{\beta^\top x_{ki}}\Lambda_{0ki} \right]
\] \[
\frac{\partial l}{\partial h_t} = \sum_k \left[\sum_i \frac{\delta_{ki}}{h_{0ki}} 1(R_{ki} = t) - x_{ki} \delta_{ki} - \sum_i z_{ki} e^{\beta^\top x_{ki}} 1(t \in (L_{ki}, R_{ki}]) \right]
\] where \(R_{ki}\) is the tstop
of the line and \(L_{ki}\) is the tstart
of the line.
The first matrix that of second derivatives is easy to calculate and it is all written only in R, so the code is there and pretty easy to descipher.
For the second matrix it gets a bit tricky. Take the part within brackets of \(\partial l / \partial \beta\) as \(B_k\). Then we can write: \[ \mathrm{E} \left[\frac{\partial l}{\partial \beta} \frac{\partial l}{\partial \beta^\top} \right] = \mathrm{E} \left[\sum_k B_k \sum_l B_l \right] = \sum_k \sum_{l \neq k} \mathrm{E} B_k \mathrm{E} B_l^\top + \sum_k \mathrm{E} B_k B_k^\top \] Now we use the fact that the score functions are 0 at the maximum likelihood, so this can be written as \[ \mathrm{E} \left[\frac{\partial l}{\partial \beta} \frac{\partial l}{\partial \beta^\top} \right] = = \sum_k \left[ \mathrm{E} B_k B_k^\top - \mathrm{E} B_k \mathrm{E}B_k^\top \right] \] Now further we can split \(B_k\) into \(B_{k1}\) and \(B_{k2}\), with the first part does not depend on the frailty. So all expectations that involve that one are actually the same in both terms from the equation here, so they cancel out. Now about the frailty we know that the estimates are independent if they are from different clusters. Either way, we can write this thing as \[ \sum_k \sum_{i,j} \left( \mathrm{E} [Z_{ki} Z_{kj}] - \mathrm{E} Z_{ki} \mathrm{E} Z_{kj} \right) \,xelpH_{ki} \,\, xelpH_{kj}^\top \] where \(xelpH_{ki} = x_{ki} e^{\beta^\top x_{ki}} \Lambda_{0ki}\).
The rest of the combinations are done in a similar fashion. Here is an outline of how the algorithm actually things about stuff: