This vignette is part of the R package mipred
. It documents data analysis for the calibration of predictive models using mipred
functions for binary outcome prediction when predictors contain missing values.
The package mipred
contains two basic functions. The first is mipred.cv
, which estimates cross-validated predictions when predictors contain missing values and using multiple imputation. The second is mipred
, which allows users to apply the same methodology to predict outcome for novel observations, based on past calibration data. Both the new observations, as well as the calibration data may contain missing observations in the predictor data. This document describes data analysis approaches using the mipred
package functions for the above objectives. We first discuss cross-validation of prediction with mipred.cv
, using both the averaging
and rubin
methods as described in the paper by (Mertens, Banzato, and Wreede 2018) to estimate the expected prediction performance on future data. We subsequently describe use of the mipred
function to estimated predictions on new patient data, based on past data. Finally, mipred
package functionality and options are discussed.
The CLL data is included in the mipred
package as a data frame. Refer to the published paper (ref) for some more detail.
data(cll)
head(cll)
#> id age10 perfstat remstat cyto asct
#> 1 1 0.08 <NA> CR other no prior ASCT
#> 2 2 0.82 Karnofsky 90 PR no abnormality no prior ASCT
#> 3 3 -0.28 Karnofsky 80 SD/PD other no prior ASCT
#> 4 4 -0.40 Karnofsky 90 PR other no prior ASCT
#> 5 5 -0.36 <NA> SD/PD del11q no prior ASCT
#> 6 6 0.89 Karnofsky 90 PR <NA> prior ASCT
#> donor sex_match cond srv5y srv5y_s
#> 1 Matched Related PATfemaleDONfemale NMA 60.02 0
#> 2 matched UD PATfemaleDONmale RIC 39.97 0
#> 3 matched UD PATfemaleDONfemale NMA 8.09 1
#> 4 matched UD PATfemaleDONmale RIC 47.79 0
#> 5 Matched Related PATmaleDONfemale RIC 6.11 1
#> 6 matched UD PATmaleDONmale RIC 31.36 0
The CLL data considers a survival problem subject to (right) censoring. To get the same data as discussed in the paper for binary outcome analysis, we first apply administrative censoring and generate the one-year binary outcome as below.
cll_bin <- cll # Generate a new data copy
cll_bin$srv5y_s[cll_bin$srv5y>12] <- 0 # Apply an administrative censorship at t=12 months
cll_bin$srv5y[cll_bin$srv5y>12] <- 12
cll_bin$Status[cll_bin$srv5y_s==1] <- 1 # Define the new binary "Status" outcome variable
cll_bin$Status[cll_bin$srv5y_s==0] <- 0 # Encoding is 1:Dead, 0:Alive
Cross-tabulate the number of patients artificially censored at 12 months versus the new binary outcome status indicator.
binary.outcome<-cll_bin$Status
artificially.censored<-(cll_bin$Status==0 & cll_bin$srv5y<12)
tableout <- table(artificially.censored,binary.outcome)
tableout
#> binary.outcome
#> artificially.censored 0 1
#> FALSE 465 184
#> TRUE 45 0
The percentage artificially censored is 100*45/694= 6.5, with 694=465+184+45 the total sample size. There are 184 events corresponding to 26.5 percent of the dataset.
Remove the original survival outcomes from the data frame before proceeding.
This completely defines the data.frame for use in further analysis. Aside from the (binary) outcome variable (Status
), we have a single continuous predictor (age10
). All other predictor variables (7 in total) are categorical, which gives 8 predictors in total. The first column contains a numeric identification variable with unique integers (1:694
) for each individual.
It makes sense to first check the missing data pattern as well as some simple exploratory data summaries on predictors. Load the mice package if not already loaded.
#> id age10 asct donor Status sex_match cond remstat perfstat cyto
#> 453 1 1 1 1 1 1 1 1 1 1 0
#> 137 1 1 1 1 1 1 1 1 1 0 1
#> 36 1 1 1 1 1 1 1 1 0 1 1
#> 12 1 1 1 1 1 1 1 1 0 0 2
#> 27 1 1 1 1 1 1 1 0 1 1 1
#> 10 1 1 1 1 1 1 1 0 1 0 2
#> 1 1 1 1 1 1 1 1 0 0 1 2
#> 2 1 1 1 1 1 1 1 0 0 0 3
#> 1 1 1 1 1 1 1 0 1 0 1 2
#> 5 1 1 1 1 1 1 0 1 0 0 3
#> 1 1 1 1 1 1 1 0 0 0 1 3
#> 1 1 1 1 1 1 1 0 0 0 0 4
#> 1 1 1 1 1 1 0 1 1 1 1 1
#> 3 1 1 1 1 1 0 1 1 1 0 2
#> 2 1 1 1 1 1 0 1 1 0 1 2
#> 1 1 1 1 1 1 0 1 1 0 0 3
#> 1 1 1 1 1 1 0 0 1 0 1 3
#> 0 0 0 0 0 8 9 42 63 171 293
There are 293 missing values in total. There are 241=694-453 records containing missing values. Most missing values occur in variables cyto
(171), perfstat
(63) and remstat
(42). Age is the only continuous predictor. Since we wish to investigate predictions and (cross-)validation in particular, it makes sense to explore the distribution of factor levels for the categorical predictors.
table(cll_bin$cyto)
#>
#> del17p del11q other no abnormality
#> 144 142 166 71
table(cll_bin$perfstat)
#>
#> Karnofsky 100 Karnofsky 90 Karnofsky 80 Karnofsky <=70
#> 186 307 113 25
table(cll_bin$remstat)
#>
#> CR PR SD/PD
#> 84 344 224
table(cll_bin$asct)
#>
#> no prior ASCT prior ASCT
#> 620 74
table(cll_bin$donor)
#>
#> Matched Related matched UD partially mismatched UD
#> 283 327 84
table(cll_bin$sex_match)
#>
#> PATmaleDONmale PATmaleDONfemale PATfemaleDONmale
#> 363 145 98
#> PATfemaleDONfemale
#> 80
table(cll_bin$cond)
#>
#> NMA RIC MAC
#> 223 360 102
These tables by default exclude the missing values. Note how the lowest factor level of remstat (Karnofsky <=70) is very sparse with only 25 observations. Similar output which retains the information on missing values can also be obtained from
summary(cll_bin)
#> id age10 perfstat remstat
#> Length:694 Min. :-3.58000 Karnofsky 100 :186 CR : 84
#> Class :character 1st Qu.:-0.58750 Karnofsky 90 :307 PR :344
#> Mode :character Median : 0.04000 Karnofsky 80 :113 SD/PD:224
#> Mean :-0.06703 Karnofsky <=70: 25 NA's : 42
#> 3rd Qu.: 0.51000 NA's : 63
#> Max. : 1.91000
#> cyto asct donor
#> del17p :144 no prior ASCT:620 Matched Related :283
#> del11q :142 prior ASCT : 74 matched UD :327
#> other :166 partially mismatched UD: 84
#> no abnormality: 71
#> NA's :171
#>
#> sex_match cond Status
#> PATmaleDONmale :363 NMA :223 Min. :0.0000
#> PATmaleDONfemale :145 RIC :360 1st Qu.:0.0000
#> PATfemaleDONmale : 98 MAC :102 Median :0.0000
#> PATfemaleDONfemale: 80 NA's: 9 Mean :0.2651
#> NA's : 8 3rd Qu.:1.0000
#> Max. :1.0000
We first estimate prediction results using internal validation on the cll_bin data, using all predictors and running 10-fold cross-validation (argument folds
) with 10 imputations (argument nimp
) for each prediction. Note the -1
notation in cll_bin[,-1]
is to remove the first column containing the observation identification number.
predcv_av <-
mipred.cv(
Status ~ age10 + perfstat + remstat + cyto + asct + donor + sex_match + cond,
family = binomial,
data = cll_bin[, -1],
nimp = 10,
folds = 10
)
The above code implements cross-validation for the averaging approach as described in the paper by Mertens, Banzato, and Wreede (2018). In brief, to generate 10 imputations, 10 copies of the data matrix are made and a separate (in this case 10-fold) fold-partition is defined for each matrix. In each matrix copy, each fold is then considered in turn as a validation set (with the remainder of the data as calibration) and the outcomes are removed (artificially set to missing) in the left-out fold. A single imputation is then generated on this data matrix. The (imputed) training potion of the data is then used to fit a (logistic regression) model, which is applied to the imputed data in the left-out (training) fold. This process is repeated for all folds. By applying the above calculation to all matrices, we obtain 10 cross-validated predictions, each on a single imputation (10 in total), for each observation.
The returned object is a list containing three elements: the function call
, as well as the predictions on the response scale (fitted probabilities in this case, in matrix pred
) and the linear predictor (linpred
). The prediction method used is the averaging method by default. The two prediction matrices are each saved in a matrix with 694 rows by 10 columns, with each column containing prediction results based on the prediction model calibrated on a single imputation. Rows correspond to the observations in the data matrix, in same order. The cross-validation fold-definitions used are different from column to column in the averaging approach. The function linking both matrices pred
and linpred
is the logit link (by default of the family specification family = binomial
).
head(predcv_av$pred)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> 1 0.1128680 0.1269674 0.1167276 0.1071034 0.09534024 0.1364446 0.1249438
#> 2 0.1842251 0.3527772 0.2507300 0.2949510 0.34232462 0.3103954 0.2619670
#> 3 0.2664314 0.2036986 0.2538854 0.2654940 0.25080963 0.2997654 0.2550819
#> 4 0.1787659 0.2079587 0.1481280 0.1637103 0.18579302 0.1472017 0.1553044
#> 5 0.3385423 0.3818292 0.3271067 0.5398961 0.37388643 0.4239001 0.3289980
#> 6 0.4020641 0.3119868 0.3349398 0.3885224 0.38323241 0.3616578 0.3441534
#> [,8] [,9] [,10]
#> 1 0.1343942 0.09427873 0.1635920
#> 2 0.2080130 0.27142416 0.3150706
#> 3 0.2654622 0.27826623 0.4013794
#> 4 0.1438300 0.18015161 0.1738653
#> 5 0.2840912 0.45497986 0.4279299
#> 6 0.3767491 0.37957767 0.2956087
head(predcv_av$linpred)
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> 1 -2.0617743 -1.9280424 -2.0237904 -2.1206760 -2.2501070 -1.8451392
#> 2 -1.4879802 -0.6068539 -1.0947229 -0.8714580 -0.6529520 -0.7982714
#> 3 -1.0128045 -1.3633364 -1.0779962 -1.0176060 -1.0942989 -0.8484150
#> 4 -1.5247311 -1.3372738 -1.7493596 -1.6308770 -1.4775813 -1.7567195
#> 5 -0.6697969 -0.4817915 -0.7213004 0.1599245 -0.5155797 -0.3067832
#> 6 -0.3968719 -0.7908471 -0.6859269 -0.4535274 -0.4758507 -0.5681759
#> [,7] [,8] [,9] [,10]
#> 1 -1.9464241 -1.8626526 -2.2624760 -1.6317413
#> 2 -1.0357701 -1.3369445 -0.9874089 -0.7765190
#> 3 -1.0716894 -1.0177692 -0.9530780 -0.3997211
#> 4 -1.6935891 -1.7838370 -1.5153206 -1.5584771
#> 5 -0.7127206 -0.9242577 -0.1805696 -0.2903023
#> 6 -0.6448395 -0.5033696 -0.4913412 -0.8682974
To obtain predictions calculated from pooled Rubin’s rules models, we need to set the method option explicitly to “rubin”.
predcv_rb <-
mipred.cv(
Status ~ age10 + perfstat + remstat + cyto + asct + donor + sex_match + cond,
family = binomial,
data = cll_bin[, -1],
nimp = 10,
folds = 10,
method = "rubin"
)
For both the “rubin” and “averaging” method, the above codes generate 10 predictions (1 for each imputation, of which we have 10 in total). With the averaging method, these will typically differ, because a separate prediction model is generated for any left-out datum based on each single-imputed dataset. The “rubin” approach uses the pooled model as single prediction model. Nevertheless, any record to be predicted which itself contains missing data will also be imputed. Thus, there will generally also be 10 distinct predictions for such observations with the (pooled) rubin approach. Records which are fully observed will have the same predictions across all imputations as we are applying the same - pooled - model on the same fully observed record of predictors.
head(predcv_rb$pred)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> 1 0.1128963 0.1016203 0.1016203 0.1016203 0.1128963 0.1128963 0.1128963
#> 2 0.2723166 0.2723166 0.2723166 0.2723166 0.2723166 0.2723166 0.2723166
#> 3 0.2329543 0.2329543 0.2329543 0.2329543 0.2329543 0.2329543 0.2329543
#> 4 0.1825556 0.1825556 0.1825556 0.1825556 0.1825556 0.1825556 0.1825556
#> 5 0.3234484 0.4475113 0.4475113 0.3160147 0.3234484 0.3160147 0.3234484
#> 6 0.3282684 0.3255040 0.3282684 0.3077205 0.3255040 0.4103372 0.3255040
#> [,8] [,9] [,10]
#> 1 0.1016203 0.1016203 0.1016203
#> 2 0.2723166 0.2723166 0.2723166
#> 3 0.2329543 0.2329543 0.2329543
#> 4 0.1825556 0.1825556 0.1825556
#> 5 0.3234484 0.3234484 0.4475113
#> 6 0.3255040 0.3282684 0.4103372
head(predcv_rb$linpred)
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> 1 -2.0614922 -2.1793497 -2.1793497 -2.1793497 -2.0614922 -2.0614922
#> 2 -0.9829007 -0.9829007 -0.9829007 -0.9829007 -0.9829007 -0.9829007
#> 3 -1.1917041 -1.1917041 -1.1917041 -1.1917041 -1.1917041 -1.1917041
#> 4 -1.4991283 -1.4991283 -1.4991283 -1.4991283 -1.4991283 -1.4991283
#> 5 -0.7379689 -0.2107314 -0.2107314 -0.7721476 -0.7379689 -0.7721476
#> 6 -0.7160274 -0.7285910 -0.7160274 -0.8107981 -0.7285910 -0.3625714
#> [,7] [,8] [,9] [,10]
#> 1 -2.0614922 -2.1793497 -2.1793497 -2.1793497
#> 2 -0.9829007 -0.9829007 -0.9829007 -0.9829007
#> 3 -1.1917041 -1.1917041 -1.1917041 -1.1917041
#> 4 -1.4991283 -1.4991283 -1.4991283 -1.4991283
#> 5 -0.7379689 -0.7379689 -0.7379689 -0.2107314
#> 6 -0.7285910 -0.7285910 -0.7160274 -0.3625714
Note how in the above prediction matrices pred
and linpred
the entries in rows 2-4 are the same for all 10 predictions, because these prediction are calculated from the same pooled model on observations which do not contain missing values. Observations 1, 5 and 6 however do contain missing values and hence, the imputations for these missing data also change across the 10 predictions, even though the prediction model itself is fixed. The final prediction for any record can be obtained by taking averages, medians or other suitable statistic across the distinct predictions. Here we use means for example, but other applications may well require another combination.
We can investigate some plots to get a feel for the calibrated predictions. Before proceeding, we first generate a missingness indicator for each observation, to distinguish predictions on fully observed records from those on records containing missing values, in tables and figures.
missrownrs<-sort(unique (unlist (lapply (cll_bin, function (x) which (is.na (x))))))
miss <- matrix(0, nrow=nrow(cll_bin), ncol=1)
miss[missrownrs]<-1
The below figure plots the combined predictions versus the status outcome and also compares predictions between the averaging and rubin combination method. Plotted predictions are colored red for observations containing missing predictor data and black otherwise. In the second row of plots, the left-most figure plots predictions from the rubin method versus those from the averaging approach. The second right-most figure, plots differences between predictions from both approaches versus the averaged predictions obtained from both the rubin and averaging method.
par(mfrow=c(2,2),pty="s")
boxplot(predfinal_av~cll_bin$Status)
title("averaging method")
boxplot(predfinal_rb~cll_bin$Status)
title("rubin method")
matplot(predfinal_av,predfinal_rb,pch=19,type="n",xlab="averaging method",ylab="rubin method")
matpoints(predfinal_av[miss==1],predfinal_rb[miss==1],col=2,pch=1)
matpoints(predfinal_av[miss==0],predfinal_rb[miss==0],col=1,pch=1)
title("rubin versus averaging method")
matplot((predfinal_av+predfinal_rb)/2,predfinal_av-predfinal_rb,pch=19,type="n",
xlab="average prediction",ylab="prediction difference")
matpoints((predfinal_av[miss==1]+predfinal_rb[miss==1])/2,
predfinal_av[miss==1]-predfinal_rb[miss==1],col=2,pch=1)
matpoints((predfinal_av[miss==0]+predfinal_rb[miss==0])/2,
predfinal_av[miss==0]-predfinal_rb[miss==0],col=1,pch=1)
title("differences versus average prediction")
The combined predictions as obtained from the mean seem to be similar between both approaches, but some variation is observed between predictions from both methods. Most of this variation is due to between-imputation variation. This raises some obvious questions.
We can investigate the impact of increased numbers of imputations for each prediction combination approach by running multiple replicates of the above analysis and with different numbers of imputations. Note the below code can take considerable time to run.
repslist_av <- vector("list", 3)
m <- c(1, 10, 100)
for (counter in 1:3) {
reps_av <- array(NA, dim = c(nrow(cll_bin), m[counter], 10))
for (rep in 1:10) {
reps_av[, , rep] <-
mipred.cv(
Status ~ age10 + perfstat + remstat + cyto + asct + donor + sex_match + cond,
family = binomial,
data = cll_bin[,-1],
nimp = m[counter],
folds = 10
)[[2]]
}
repslist_av[[counter]] <- reps_av
}
Now generate same analysis for the “rubin” method.
repslist_rb <- vector("list", 3)
m <- c(1, 10, 100)
for (counter in 1:3) {
reps_rb <- array(NA, dim = c(nrow(cll_bin), m[counter], 10))
for (rep in 1:10) {
reps_rb[, , rep] <-
mipred.cv(
Status ~ age10 + perfstat + remstat + cyto + asct + donor + sex_match + cond,
family = binomial,
data = cll_bin[,-1],
nimp = m[counter],
folds = 10,
method = "rubin"
)[[2]]
}
repslist_rb[[counter]] <- reps_rb
}
We first investigate the averaging method. To investigate the spread of individual predictions, calculated from models on a single imputation, we first plot the these constituent (individual) predictions (from repslist_av) versus the combined predictions (the means) for the averaging method. Use distinct plotting symbols for predictions based on fully observed records versus records that contained missing values. Use only the first replicate from nimp=100
(the third component of repslist_av
, corresponding to results for 100 imputations. Calculate the final prediction for each patient for this replicate by combining the individual predictions using means.
Now make the plot.
par(mfrow=c(1,2),pty="s")
matplot(avcv_mean3,repslist_av[[3]][,,1],pch=19,type="n",
xlab="ordered mean probability",ylab="individual imputed predictions")
matpoints(avcv_mean3[miss==1],repslist_av[[3]][miss==1,,1],col=2,pch=1)
matpoints(avcv_mean3[miss==0],repslist_av[[3]][miss==0,,1],col=1,pch=1)
matplot(avcv_mean3,
repslist_av[[3]][,,1]-avcv_mean3%*%matrix(1,nrow=1,ncol=ncol(repslist_av[[3]][,,1])),
pch=19,type="n",xlab="ordered mean probability",ylab="mean-centered individual imputed predictions")
matpoints(avcv_mean3[miss==1],
repslist_av[[3]][miss==1,,1]-
avcv_mean3[miss==1]%*%matrix(1,nrow=1,ncol=ncol(repslist_av[[3]][,,1])),
col=2,pch=1)
matpoints(avcv_mean3[miss==0],
repslist_av[[3]][miss==0,,1]-
avcv_mean3[miss==0]%*%matrix(1,nrow=1,ncol=ncol(repslist_av[[3]][,,1])),
col=1,pch=1)
The left plot shows individual predictions (from single-imputation-based models) versus the mean prediction. The right plot shows mean-centered predictions versus mean prediction. As can be seen, the individual predictions have very high levels of variation (around the mean prediction). Predictions on (partially missing records - plotted red) are much more variable than those on fully-observed records. Variation tends to decrease as the mean prediction approaches either 0 or 1.
Now investigate the Rubin method results in the same manner. Again only consider a single replication for nimp=100
. Plot the constituent (individual) predictions (from repslist_rb) versus the combined predictions (the means) for the rubin method. Use distinct plotting symbols for predictions based on fully observed records versus records that contained missing values.
First, calculate the final prediction for each patient for this replicate by combining the individual predictions using means.
Now make the plots.
par(mfrow=c(1,2),pty="s")
matplot(rbcv_mean3,repslist_rb[[3]][,,1],pch=19,type="n",
xlab="ordered mean probability",ylab="individual imputed predictions")
matpoints(rbcv_mean3[miss==1],repslist_rb[[3]][miss==1,,1],col=2,pch=1)
matpoints(rbcv_mean3[miss==0],repslist_rb[[3]][miss==0,,1],col=1,pch=1)
matplot(rbcv_mean3,
repslist_rb[[3]][,,1]-rbcv_mean3%*%matrix(1,nrow=1,ncol=ncol(repslist_rb[[3]][,,1])),
pch=19,type="n",
xlab="ordered mean probability",ylab="mean-centered individual imputed predictions")
matpoints(rbcv_mean3[miss==1],
repslist_rb[[3]][miss==1,,1]-
rbcv_mean3[miss==1]%*%matrix(1,nrow=1,ncol=ncol(repslist_rb[[3]][,,1])),
col=2,pch=1)
matpoints(rbcv_mean3[miss==0],
repslist_rb[[3]][miss==0,,1]-
rbcv_mean3[miss==0]%*%matrix(1,nrow=1,ncol=ncol(repslist_rb[[3]][,,1])),
col=1,pch=1)
The left plot shows individual predictions (from the Rubin-rule pooled model on each of the [possibly imputed] records) versus the mean prediction. The right plot shows mean-centered predictions versus mean prediction. As can be seen, the individual predictions have very high levels of variation (around the mean level) when predicting on records which are partially observed (contain missing values). Note how there is no variability in predictions on fully-observed records now between the imputations, because we are predicting from a single pooled model. Predictions on (partially missing records - plotted red) are however variable because we also need to impute the missing data each time we want to predict (so there are
nimp=100
distinct predictions in that case). Variation tends to decrease as the mean prediction approaches either 0 or 1.
Next we study the differences (variation) between the predictions obtained from the averaging method when generating replicates of the prediction, calculated from the same number of imputations, but with distinct imputations. We repeat this analysis for nimp=1
, nimp=10
and nimp=100
imputations. In other words, we replicate the prediction analysis and investigate the impact of increasing the number of imputations used on the stability of the prediction.
In the below plots, the top row of plots show the replicated predictions versus the mean prediction. The bottom row of plots displays the mean-centered replicate predictions versus mean prediction. First calculate the mean across all imputations and replications.
avcv_means1 <- apply(repslist_av[[1]],1,mean) # one imputation
avcv_means2 <- apply(repslist_av[[2]],1,mean) # ten imputations
avcv_means3 <- apply(repslist_av[[3]],1,mean) # one hundred imputations
Now make the plots.
par(mfrow=c(2,3),pty="s")
matplot(avcv_means1,apply(repslist_av[[1]],c(1,3),mean),pch=19,type="n",
xlab="ordered mean probability",ylab="individual imputed predictions")
matpoints(avcv_means1[miss==1],apply(repslist_av[[1]][miss==1,,,drop=FALSE],c(1,3),mean),
col=2,pch=1)
matpoints(avcv_means1[miss==0],apply(repslist_av[[1]][miss==0,,,drop=FALSE],c(1,3),mean),
col=1,pch=1)
title('single imputation')
matplot(avcv_means2,apply(repslist_av[[2]],c(1,3),mean),pch=19,type="n",
xlab="ordered mean probability",ylab="individual imputed predictions")
matpoints(avcv_means2[miss==1],apply(repslist_av[[2]][miss==1,,],c(1,3),mean),
col=2,pch=1)
matpoints(avcv_means2[miss==0],apply(repslist_av[[2]][miss==0,,],c(1,3),mean),
col=1,pch=1)
title('10 imputations')
matplot(avcv_means3,apply(repslist_av[[3]],c(1,3),mean),pch=19,type="n",
xlab="ordered mean probability",ylab="individual imputed predictions")
matpoints(avcv_means3[miss==1],apply(repslist_av[[3]][miss==1,,],c(1,3),mean),
col=2,pch=1)
matpoints(avcv_means3[miss==0],apply(repslist_av[[3]][miss==0,,],c(1,3),mean),col=1,pch=1)
title('100 imputations')
matplot(avcv_means1,apply(repslist_av[[1]],c(1,3),mean)-avcv_means1%*%matrix(1,nrow=1,ncol=10),
pch=19,type="n",xlab="ordered mean probability",ylab="individual imputed predictions")
matpoints(avcv_means1[miss==1],
apply(repslist_av[[1]][miss==1,,,drop=FALSE],c(1,3),mean)-
avcv_means1[miss==1]%*%matrix(1,nrow=1,ncol=10),
col=2,pch=1)
matpoints(avcv_means1[miss==0],
apply(repslist_av[[1]][miss==0,,,drop=FALSE],c(1,3),mean)-
avcv_means1[miss==0]%*%matrix(1,nrow=1,ncol=10),
col=1,pch=1)
matplot(avcv_means2,
apply(repslist_av[[2]],c(1,3),mean)-avcv_means2%*%matrix(1,nrow=1,ncol=10),
pch=19,type="n",xlab="ordered mean probability",ylab="individual imputed predictions")
matpoints(avcv_means2[miss==1],
apply(repslist_av[[2]][miss==1,,],c(1,3),mean)-
avcv_means2[miss==1]%*%matrix(1,nrow=1,ncol=10),
col=2,pch=1)
matpoints(avcv_means2[miss==0],
apply(repslist_av[[2]][miss==0,,],c(1,3),mean)-
avcv_means2[miss==0]%*%matrix(1,nrow=1,ncol=10),
col=1,pch=1)
matplot(avcv_means3,
apply(repslist_av[[3]],c(1,3),mean)-avcv_means3%*%matrix(1,nrow=1,ncol=10),
pch=19,type="n",xlab="ordered mean probability",ylab="individual imputed predictions")
matpoints(avcv_means3[miss==1],
apply(repslist_av[[3]][miss==1,,],c(1,3),mean)-
avcv_means3[miss==1]%*%matrix(1,nrow=1,ncol=10),
col=2,pch=1)
matpoints(avcv_means3[miss==0],
apply(repslist_av[[3]][miss==0,,],c(1,3),mean)-
avcv_means3[miss==0]%*%matrix(1,nrow=1,ncol=10),
col=1,pch=1)
We observe prediction on partially observed records to have lower levels of (between-replication) variation as compared to prediction on fully observed data. Variation reduces substantially when the numbers of imputations used are increased. Observation with mean predicted probabilities close to either 0 or 1 have lower between-replication variation.
Next investigate the dispersal between the predictions for replicates of the analyses with different number of imputations used, for the rubin method. Top row of plots is replicate predictions versus the mean prediction. Bottom row of plot is mean-centered replicate predictions versus mean prediction. First calculate the mean across all imputations and replications.
rbcv_means1 <- apply(repslist_rb[[1]],1,mean)
rbcv_means2 <- apply(repslist_rb[[2]],1,mean)
rbcv_means3 <- apply(repslist_rb[[3]],1,mean)
Now make the plots.
par(mfrow=c(2,3),pty="s")
matplot(rbcv_means1,apply(repslist_rb[[1]],c(1,3),mean),
pch=19,type="n",xlab="ordered mean probability",ylab="individual imputed predictions")
matpoints(rbcv_means1[miss==1],apply(repslist_rb[[1]][miss==1,,,drop=FALSE],c(1,3),mean),
col=2,pch=1)
matpoints(rbcv_means1[miss==0],apply(repslist_rb[[1]][miss==0,,,drop=FALSE],c(1,3),mean),
col=1,pch=1)
title('single imputation')
matplot(rbcv_means2,apply(repslist_rb[[2]],c(1,3),mean),
pch=19,type="n",xlab="ordered mean probability",ylab="individual imputed predictions")
matpoints(rbcv_means2[miss==1],apply(repslist_rb[[2]][miss==1,,],c(1,3),mean),col=2,pch=1)
matpoints(rbcv_means2[miss==0],apply(repslist_rb[[2]][miss==0,,],c(1,3),mean),col=1,pch=1)
title('10 imputations')
matplot(rbcv_means3,apply(repslist_rb[[3]],c(1,3),mean),
pch=19,type="n",xlab="ordered mean probability",ylab="individual imputed predictions")
matpoints(rbcv_means3[miss==1],apply(repslist_rb[[3]][miss==1,,],c(1,3),mean),col=2,pch=1)
matpoints(rbcv_means3[miss==0],apply(repslist_rb[[3]][miss==0,,],c(1,3),mean),col=1,pch=1)
title('100 imputations')
matplot(rbcv_means1,
apply(repslist_rb[[1]],c(1,3),mean)-rbcv_means1%*%matrix(1,nrow=1,ncol=10),
pch=19,type="n",xlab="ordered mean probability",ylab="individual imputed predictions")
matpoints(rbcv_means1[miss==1],
apply(repslist_rb[[1]][miss==1,,,drop=FALSE],c(1,3),mean)-
rbcv_means1[miss==1]%*%matrix(1,nrow=1,ncol=10),
col=2,pch=1)
matpoints(rbcv_means1[miss==0],
apply(repslist_rb[[1]][miss==0,,,drop=FALSE],c(1,3),mean)-
rbcv_means1[miss==0]%*%matrix(1,nrow=1,ncol=10),
col=1,pch=1)
matplot(rbcv_means2,
apply(repslist_rb[[2]],c(1,3),mean)-rbcv_means2%*%matrix(1,nrow=1,ncol=10),
pch=19,type="n",xlab="ordered mean probability",ylab="individual imputed predictions")
matpoints(rbcv_means2[miss==1],
apply(repslist_rb[[2]][miss==1,,],c(1,3),mean)-
rbcv_means2[miss==1]%*%matrix(1,nrow=1,ncol=10),
col=2,pch=1)
matpoints(rbcv_means2[miss==0],
apply(repslist_rb[[2]][miss==0,,],c(1,3),mean)-
rbcv_means2[miss==0]%*%matrix(1,nrow=1,ncol=10),
col=1,pch=1)
matplot(rbcv_means3,
apply(repslist_rb[[3]],c(1,3),mean)-rbcv_means3%*%matrix(1,nrow=1,ncol=10),
pch=19,type="n",xlab="ordered mean probability",ylab="individual imputed predictions")
matpoints(rbcv_means3[miss==1],
apply(repslist_rb[[3]][miss==1,,],c(1,3),mean)-
rbcv_means3[miss==1]%*%matrix(1,nrow=1,ncol=10),
col=2,pch=1)
matpoints(rbcv_means3[miss==0],
apply(repslist_rb[[3]][miss==0,,],c(1,3),mean)-
rbcv_means3[miss==0]%*%matrix(1,nrow=1,ncol=10),
col=1,pch=1)
We again observe prediction on partially observed records to have lower levels of (between-replication) variation as compared to prediction on fully observed data. Variation reduces substantially when the numbers of imputations used are increased. Observation with mean predicted probabilities close to either 0 or 1 have lower between-replication variation. Note however that the dispersal seem to be larger as compared to results from the averaging method and this effect is particularly noticeable for the higher numbers of imputations.
To further study the variation due to imputation, we summarize variation using the `R’ statistics as defined in the paper (ref). First define a function that calculates the required statistics on the replicated analyses.
R.statistic <- function(reps, miss){
resmat<-matrix(NA,nrow=2,ncol=7)
dimnames(resmat)<-list(c("missing","observed"),
c("R (range)","q10","median","q90","missing","replicates", "mean"))
means <- apply(reps, 1, mean)
diffs<-reps-means%*%matrix(1,nrow=1,ncol=ncol(reps)) # remove means
# variation between p=0.2 and 0.8, and fully observed records
diffssel<-diffs[means<0.8&means>0.2&miss==0,] # select between 0.2 and 0.8 and observed records
quant<-quantile(as.vector(diffssel),c(.10, 0.5, .90))
resmat[2,2:4]<-quant
resmat[2,1]<-quant[3]-quant[1] # R measure
resmat[2,5]<-0 #missing record? 1=yes, 0=no
resmat[2,6]<-ncol(reps) # number of replicates
resmat[2,7]<-mean(as.vector(diffssel)) # mean
# variation between p=0.2 and 0.8, and missing records
diffssel<-diffs[means<0.8&means>0.2&miss==1,] # select between 0.2 and 0.8 and missing records
quant<-quantile(as.vector(diffssel),c(.10, 0.5, .90))
resmat[1,2:4]<-quant
resmat[1,1]<-quant[3]-quant[1] # R measure
resmat[1,5]<-1 #missing record? 1=yes, 0=no
resmat[1,6]<-ncol(reps) # number of replicates
resmat[1,7]<-mean(as.vector(diffssel)) # mean
resmat
}
Now apply the function to the replicated results from the averaging approach and combine the summary measures in a matrix. Multiply all statistics by a factor 100 so they can be read at percentage scale.
R1.av <- R.statistic(apply(repslist_av[[1]],c(1,3),mean),miss)
R2.av <- R.statistic(apply(repslist_av[[2]],c(1,3),mean),miss)
R3.av <- R.statistic(apply(repslist_av[[3]],c(1,3),mean),miss)
# Make percentages
Rstat.av <-
rbind(R1.av, R2.av, R3.av)*matrix(rep(c(100,100,100,100,1,1,100),6),byrow=T,ncol=7)
Rstat.av.miss <- t(Rstat.av[c(1,3,5),1])
Rstat.av.obsv <- t(Rstat.av[c(2,4,6),1])
Now do the same for the replicated results from the rubin approach and combine the summary measures in a matrix. Multiply all statistics by a factor 100 so they can be read at percentage scale.
R1.rb <- R.statistic(apply(repslist_rb[[1]],c(1,3),mean),miss)
R2.rb <- R.statistic(apply(repslist_rb[[2]],c(1,3),mean),miss)
R3.rb <- R.statistic(apply(repslist_rb[[3]],c(1,3),mean),miss)
# Make percentages
Rstat.rb <-
rbind(R1.rb, R2.rb, R3.rb)*matrix(rep(c(100,100,100,100,1,1,100),6),byrow=T,ncol=7)
Rstat.rb.miss <- t(Rstat.rb[c(1,3,5),1])
Rstat.rb.obsv <- t(Rstat.rb[c(2,4,6),1])
Produce the variance `decay’ plots for both approaches.
par(mfrow=c(1,2),pty="s")
index <- c(1, 10, 100)
# first plot - averaging approach.
matplot(index,cbind(t(Rstat.av.miss), t(Rstat.av.obsv)),
type="n",log="x",pch=NULL,ylim=c(0,15),axes=F,
xlab="number of imputations",ylab="percentage deviation R")
matlines(index,cbind(t(Rstat.av.miss), t(Rstat.av.obsv)),
type="b",lty=c(1,1,2,2),col=1,log="x",pch=c(1,20))
axis(1,at=index)
axis(2)
box()
title("Averaging approach")
# second plot - rubin approach.
matplot(index,cbind(t(Rstat.rb.miss), t(Rstat.rb.obsv)),
type="n",log="x",pch=NULL,ylim=c(0,15),axes=F,
xlab="number of imputations",ylab="percentage deviation R")
matlines(index,cbind(t(Rstat.rb.miss), t(Rstat.rb.obsv)),
type="b",lty=c(1,1,2,2),col=1,log="x",pch=c(1,20))
axis(1,at=c(1,10,100))
axis(2)
box()
title("Rubin approach")
Present the results as table, for approach 1 (prediction-averaging).
knitr::kable(round(Rstat.av, digits=2),
caption="Variance summaries of deviation relative to the mean prediction between
replicate predictions for different imputations with the averaging approach.
All statistics were multiplied by 100.")
R (range) | q10 | median | q90 | missing | replicates | mean | |
---|---|---|---|---|---|---|---|
missing | 15.37 | -7.74 | -0.36 | 7.63 | 1 | 10 | 0 |
observed | 9.13 | -4.62 | 0.02 | 4.51 | 0 | 10 | 0 |
missing | 4.51 | -2.26 | 0.01 | 2.24 | 1 | 10 | 0 |
observed | 2.84 | -1.39 | -0.02 | 1.45 | 0 | 10 | 0 |
missing | 1.43 | -0.72 | -0.01 | 0.72 | 1 | 10 | 0 |
observed | 0.90 | -0.45 | 0.00 | 0.45 | 0 | 10 | 0 |
Present the results as table, for approach 2 (rubin).
knitr::kable(round(Rstat.rb, digits=2),
caption="Variance summaries of deviation relative to the mean prediction between
predictions for different imputations with the rubin approach.
All statistics were multiplied by 100.")
R (range) | q10 | median | q90 | missing | replicates | mean | |
---|---|---|---|---|---|---|---|
missing | 14.56 | -6.82 | -0.53 | 7.74 | 1 | 10 | 0 |
observed | 9.06 | -4.40 | -0.04 | 4.66 | 0 | 10 | 0 |
missing | 7.72 | -3.66 | -0.17 | 4.06 | 1 | 10 | 0 |
observed | 7.27 | -3.62 | -0.01 | 3.65 | 0 | 10 | 0 |
missing | 6.17 | -3.16 | 0.05 | 3.02 | 1 | 10 | 0 |
observed | 6.92 | -3.45 | -0.01 | 3.47 | 0 | 10 | 0 |
Next, we investigate Brier scores and AUCs. We use the pROC
package for the AUC calculations. We again first define a function which calculates the required measures.
library(pROC)
#> Type 'citation("pROC")' for a citation.
#>
#> Attaching package: 'pROC'
#> The following objects are masked from 'package:stats':
#>
#> cov, smooth, var
Briers <- function(status, reps, miss) {
resmat <- matrix(NA, nrow = 3, ncol = 4)
dimnames(resmat) <-
list(c("missing", "all", "complete"), c("Briermean", "Briersd", "AUCmean", "AUCsd"))
statustotal <- status
statusmissing <- status[miss == 1]
statusobserved <- status[miss == 0]
Briers <-
apply(reps, 2, function(x)
mean((statustotal - x) ^ 2)) # get Briers for each replicate analysis
Briersmissing <-
apply(reps[miss == 1, , drop=F], 2, function(x)
mean((statusmissing - x) ^ 2)) # for missing records only
Briersobserved <-
apply(reps[miss == 0, , drop=F], 2, function(x)
mean((statusobserved - x) ^ 2)) # for complete records
AUCs <-
apply(reps, 2, function(x)
auc(roc(statustotal, x))) # get AUCs for each replicate analysis
AUCmissing <-
apply(reps[miss == 1, , drop = F], 2, function(x)
auc(roc(statusmissing, x))) # for missing records only
AUCobserved <-
apply(reps[miss == 0, , drop = F], 2, function(x)
auc(roc(statusobserved, x))) # for complete records
resmat[1,1]<-mean(Briersmissing)
resmat[2,1]<-mean(Briers)
resmat[3,1]<-mean(Briersobserved)
resmat[1,2]<-sd(Briersmissing)
resmat[2,2]<-sd(Briers)
resmat[3,2]<-sd(Briersobserved)
resmat[1,3]<-mean(AUCmissing)
resmat[2,3]<-mean(AUCs)
resmat[3,3]<-mean(AUCobserved)
resmat[1,4]<-sd(AUCmissing)
resmat[2,4]<-sd(AUCs)
resmat[3,4]<-sd(AUCobserved)
resmat
}
B1.av <- Briers(cll_bin$Status, apply(repslist_av[[1]],c(1,3),mean),miss)
B2.av <- Briers(cll_bin$Status, apply(repslist_av[[2]],c(1,3),mean),miss)
B3.av <- Briers(cll_bin$Status, apply(repslist_av[[3]],c(1,3),mean),miss)
B1.rb <- Briers(cll_bin$Status, apply(repslist_rb[[1]],c(1,3),mean),miss)
B2.rb <- Briers(cll_bin$Status, apply(repslist_rb[[2]],c(1,3),mean),miss)
B3.rb <- Briers(cll_bin$Status, apply(repslist_rb[[3]],c(1,3),mean),miss)
Brier.av <- 100*rbind(B1.av, B2.av, B3.av) # Make percentages
Brier.rb <- 100*rbind(B1.rb, B2.rb, B3.rb) # Make percentages
Brier/AUC results as table - for approach 1, averaging
knitr::kable(round(Brier.av, digits=2),
caption = "Brier scores and AUCs calculated on predictions generated from
the averaging method with different numbers of imputations.
Standard deviations are shown for 10 replicate analyses.
All statistics were multiplied by 100.")
Briermean | Briersd | AUCmean | AUCsd | |
---|---|---|---|---|
missing | 20.63 | 0.32 | 63.02 | 0.87 |
all | 18.92 | 0.17 | 61.84 | 0.59 |
complete | 18.02 | 0.17 | 60.83 | 0.81 |
missing | 20.46 | 0.13 | 63.48 | 0.54 |
all | 18.78 | 0.08 | 62.54 | 0.38 |
complete | 17.89 | 0.06 | 61.52 | 0.36 |
missing | 20.41 | 0.04 | 63.89 | 0.21 |
all | 18.76 | 0.01 | 62.73 | 0.09 |
complete | 17.89 | 0.02 | 61.58 | 0.10 |
Brier/AUC results as table - for approach 2, rubin
knitr::kable(round(Brier.rb, digits=2),
caption = "Brier scores and AUCs calculated on predictions generated from
the rubin method with different numbers of imputations.
Standard deviations are shown for 10 replicate analyses.
All statistics were multiplied by 100.")
Briermean | Briersd | AUCmean | AUCsd | |
---|---|---|---|---|
missing | 20.78 | 0.40 | 62.50 | 1.32 |
all | 18.98 | 0.15 | 61.79 | 0.82 |
complete | 18.03 | 0.21 | 61.06 | 1.12 |
missing | 20.53 | 0.27 | 63.39 | 1.37 |
all | 18.85 | 0.17 | 62.21 | 0.88 |
complete | 17.95 | 0.18 | 61.07 | 1.01 |
missing | 20.48 | 0.26 | 63.59 | 1.24 |
all | 18.78 | 0.16 | 62.65 | 0.82 |
complete | 17.88 | 0.15 | 61.65 | 0.86 |
Make plots of Brier scores for both averaging (plotting symbol ‘1’) and rubin (plotting symbol ‘2’) approach versus number of imputations. In the top row of plots, we redraw the y-axis to fit the range of scores for each plot. The bottom row of plots shows exactly the same data, but with y-axes fixed to the range [17,21].
par(mfrow=c(2,3),pty="s")
index <- c(1, 10, 100)
matplot(index,cbind(Brier.av[c(1,4,7),1,drop=F],Brier.rb[c(1,4,7),1,drop=F]),
type="n",log="x",pch=NULL,axes=F,xlab="number of imputations",ylab="mean Brier score")
matlines(index,Brier.av[c(1,4,7),1,drop=F],
type="b",lty=c(1,1,2,2),col=1,log="x",pch="1")
matlines(index,Brier.rb[c(1,4,7),1,drop=F],
type="b",lty=c(1,1,2,2),col=1,log="x",pch="2")
axis(1,at=index)
axis(2)
box()
title("Missing data")
matplot(index,cbind(Brier.av[c(2,5,8),1,drop=F],Brier.rb[c(2,5,8),1,drop=F]),
type="n",log="x",pch=NULL,axes=F,xlab="number of imputations",ylab="mean Brier score")
matlines(index,Brier.av[c(2,5,8),1,drop=F],
type="b",lty=c(1,1,2,2),col=1,log="x",pch="1")
matlines(index,Brier.rb[c(2,5,8),1,drop=F],
type="b",lty=c(1,1,2,2),col=1,log="x",pch="2")
axis(1,at=index)
axis(2)
box()
title("All data")
matplot(index,cbind(Brier.av[c(3,6,9),1,drop=F],Brier.rb[c(3,6,9),1,drop=F]),
type="n",log="x",pch=NULL,axes=F,xlab="number of imputations",ylab="mean Brier score")
matlines(index,Brier.av[c(3,6,9),1,drop=F],
type="b",lty=c(1,1,2,2),col=1,log="x",pch="1")
matlines(index,Brier.rb[c(3,6,9),1,drop=F],
type="b",lty=c(1,1,2,2),col=1,log="x",pch="2")
axis(1,at=index)
axis(2)
box()
title("Fully observed data")
matplot(index,Brier.av[c(1,4,7),1,drop=F],type="n",log="x",
ylim=c(17.5,21),pch=NULL,axes=F,xlab="number of imputations",ylab="mean Brier score")
matlines(index,Brier.av[c(1,4,7),1,drop=F],
type="b",ylim=c(17.5,21),lty=c(1,1,2,2),col=1,log="x",pch="1")
matlines(index,Brier.rb[c(1,4,7),1,drop=F],
type="b",ylim=c(17.5,21),lty=c(1,1,2,2),col=1,log="x",pch="2")
axis(1,at=index)
axis(2)
box()
title("Missing data")
matplot(index,Brier.av[c(2,5,8),1,drop=F],type="n",log="x",
ylim=c(17.5,21),pch=NULL,axes=F,xlab="number of imputations",ylab="mean Brier score")
matlines(index,Brier.av[c(2,5,8),1,drop=F],
type="b",ylim=c(17.5,21),lty=c(1,1,2,2),col=1,log="x",pch="1")
matlines(index,Brier.rb[c(2,5,8),1,drop=F],
type="b",ylim=c(17.5,21),lty=c(1,1,2,2),col=1,log="x",pch="2")
axis(1,at=index)
axis(2)
box()
title("All data")
matplot(index,Brier.av[c(3,6,9),1,drop=F],type="n",log="x",ylim=c(17.5,21),
pch=NULL,axes=F,xlab="number of imputations",ylab="mean Brier score")
matlines(index,Brier.av[c(3,6,9),1,drop=F],
type="b",ylim=c(17.5,21),lty=c(1,1,2,2),col=1,log="x",pch="1")
matlines(index,Brier.rb[c(3,6,9),1,drop=F],
type="b",ylim=c(17.5,21),lty=c(1,1,2,2),col=1,log="x",pch="2")
axis(1,at=index)
axis(2)
box()
title("Fully observed data")
We can see from the top row of plots that there are small reductions in Brier score as the number of imputations increase, but these are small. The bottom row of plots shows that Brier scores barely reduce as number of imputations increase. Predictions generated on records containing missing data (left plots) have higher Brier scores and predictions on fully observed records (right plots) have the lowest. Brier scores calculated across all data (middle plots) are in between those for fully and partially observed records.
In this section we discuss use of mipred
to generate predictions for new observations - which may themselves contain missing values in their predictors - using either the averaging or the rubin approach and based on previously observed calibration data.
Imagine we have predictor information on 3 new patients for whom we wish to predict prognosis. As in the available calibration data studied before, the new patient record contains two missing values, for two patients. The status (outcome) variable is of course as yet unknown and to be predicted. We first load the new predictor values and display the data.
head(cll_bin_new)
#> id age10 perfstat remstat cyto asct
#> 695 695 -0.64 Karnofsky 100 <NA> other no prior ASCT
#> 696 696 0.95 Karnofsky 80 PR no abnormality prior ASCT
#> 697 697 -1.26 Karnofsky 100 CR <NA> prior ASCT
#> donor sex_match cond
#> 695 matched UD PATmaleDONmale RIC
#> 696 Matched Related PATmaleDONmale MAC
#> 697 matched UD PATmaleDONfemale RIC
Predictions can be easily obtained using either of both model approaches, either the averaging or rubin model, as follows. As our previous assessment of the predictive potential has shown that the maximum number of imputations (100) are required to achieve minimum variation for the averaging approach within the range investigated (nimp
=1 to 100), we will choose nimp=100
throughout in this section. (We leave the issue of whether even higher numbers of imputations could further reduce variation for subsequent research.) For the averaging-based prediction model we just write
pred_av <-
mipred(
Status ~ age10 + perfstat + remstat + cyto + asct + donor + sex_match + cond,
family = binomial,
data = cll_bin[, -1],
newdata = cll_bin_new[, -1],
nimp = 100,
folds = 1
)
head(pred_av$pred)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> 695 0.2205908 0.2565672 0.1813967 0.1724127 0.2800818 0.2000125 0.2484732
#> 696 0.3446457 0.3226828 0.3839933 0.3511103 0.3359150 0.3204467 0.3452571
#> 697 0.5144153 0.4679354 0.4826964 0.4824712 0.4869812 0.5158249 0.5391403
#> [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> 695 0.1834750 0.2437937 0.2944592 0.1857825 0.1795917 0.3048260 0.2832052
#> 696 0.3162901 0.3362898 0.3479651 0.3012543 0.3829905 0.3108366 0.3043695
#> 697 0.4688064 0.5098817 0.4721677 0.5166989 0.4760365 0.5097228 0.4876580
#> [,15] [,16] [,17] [,18] [,19] [,20] [,21]
#> 695 0.2473953 0.2470342 0.2330938 0.2495194 0.2107948 0.1729583 0.2606597
#> 696 0.3491989 0.3322354 0.3519739 0.3433898 0.3357030 0.3300267 0.3263113
#> 697 0.4601169 0.5386392 0.5633533 0.4286717 0.4996080 0.4578825 0.4917737
#> [,22] [,23] [,24] [,25] [,26] [,27] [,28]
#> 695 0.2498153 0.3120453 0.1910403 0.2476195 0.3104241 0.1830389 0.1915698
#> 696 0.3445127 0.3463293 0.3396873 0.3293005 0.3997151 0.2976129 0.4068006
#> 697 0.5486854 0.5058294 0.5178237 0.4766270 0.5063139 0.5439874 0.4824686
#> [,29] [,30] [,31] [,32] [,33] [,34] [,35]
#> 695 0.2233264 0.1908162 0.2871296 0.2351029 0.1884479 0.2053054 0.2371369
#> 696 0.3260395 0.3442352 0.3265322 0.3898461 0.3450795 0.3118002 0.3307605
#> 697 0.5545484 0.4799849 0.5358671 0.4457863 0.4904145 0.5072880 0.4694478
#> [,36] [,37] [,38] [,39] [,40] [,41] [,42]
#> 695 0.2883438 0.2654965 0.2496988 0.2423848 0.1811848 0.2591196 0.2431111
#> 696 0.3145142 0.3380308 0.3218346 0.3706435 0.3577158 0.3811084 0.3687986
#> 697 0.5124745 0.5086370 0.4728472 0.4581080 0.4585015 0.6057770 0.4898566
#> [,43] [,44] [,45] [,46] [,47] [,48] [,49]
#> 695 0.2597611 0.2894995 0.2623297 0.2827298 0.2147264 0.1705098 0.1509466
#> 696 0.3892342 0.3754001 0.3807093 0.2860452 0.3439104 0.3275329 0.3388833
#> 697 0.4305788 0.4782986 0.4645583 0.4536501 0.4318985 0.4529227 0.4532270
#> [,50] [,51] [,52] [,53] [,54] [,55] [,56]
#> 695 0.1728357 0.2885496 0.1779934 0.2443862 0.2318284 0.2693758 0.2521346
#> 696 0.3551222 0.3372474 0.3751368 0.3634926 0.3221564 0.4052459 0.3568232
#> 697 0.4659437 0.5406795 0.4915864 0.5626581 0.4591299 0.5160695 0.4950585
#> [,57] [,58] [,59] [,60] [,61] [,62] [,63]
#> 695 0.2464312 0.1843708 0.2376014 0.1871439 0.2564488 0.2724520 0.1923588
#> 696 0.3289145 0.3922183 0.4081546 0.3213925 0.3813590 0.3521960 0.2798270
#> 697 0.4840974 0.5367358 0.4390537 0.4721716 0.4347247 0.4727034 0.5186847
#> [,64] [,65] [,66] [,67] [,68] [,69] [,70]
#> 695 0.2357903 0.2108381 0.3043402 0.2670312 0.1441590 0.2127226 0.2494717
#> 696 0.3803099 0.3012849 0.3461816 0.3331296 0.3231126 0.3474329 0.3834110
#> 697 0.5023695 0.5052779 0.4568016 0.4768271 0.4745900 0.4475892 0.4766399
#> [,71] [,72] [,73] [,74] [,75] [,76] [,77]
#> 695 0.2576544 0.2778810 0.2076618 0.2531831 0.2592721 0.1708103 0.3068865
#> 696 0.3373666 0.3454335 0.3387970 0.3690330 0.3612338 0.3425836 0.2917687
#> 697 0.4525033 0.4930918 0.4680270 0.5206339 0.5271531 0.4904059 0.4716069
#> [,78] [,79] [,80] [,81] [,82] [,83] [,84]
#> 695 0.1524976 0.2681619 0.2755971 0.2813979 0.1863093 0.2410046 0.2230746
#> 696 0.3339773 0.3059217 0.3455054 0.3202211 0.3281367 0.4073111 0.3644811
#> 697 0.5231943 0.5237934 0.5393961 0.5038407 0.5306734 0.5358262 0.4545258
#> [,85] [,86] [,87] [,88] [,89] [,90] [,91]
#> 695 0.1850275 0.2622772 0.1782371 0.2662113 0.1914047 0.2214294 0.1757584
#> 696 0.3543382 0.3361232 0.4335220 0.3500740 0.3481424 0.3629707 0.3377046
#> 697 0.5213668 0.5485466 0.4800342 0.5043201 0.4879257 0.4917349 0.4596503
#> [,92] [,93] [,94] [,95] [,96] [,97] [,98]
#> 695 0.3069806 0.1667261 0.1844025 0.1950383 0.2715060 0.1702186 0.1758933
#> 696 0.3457220 0.3106694 0.3678376 0.3086621 0.3451855 0.3615325 0.3650984
#> 697 0.4839966 0.5561112 0.4901112 0.5183182 0.5179343 0.5637025 0.4832253
#> [,99] [,100]
#> 695 0.2489203 0.2314517
#> 696 0.4283991 0.3416001
#> 697 0.5397991 0.5065713
And similar when predicting from the rubin pooled model, which requires addition of the `method = “rubin”’ option.
pred_rb <-
mipred(
Status ~ age10 + perfstat + remstat + cyto + asct + donor + sex_match + cond,
family = binomial,
data = cll_bin[, -1],
newdata = cll_bin_new[, -1],
nimp = 100,
folds = 1,
method = "rubin"
)
head(pred_rb$pred)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> 695 0.2679987 0.2679987 0.1761605 0.2679987 0.1761605 0.2679987 0.2679987
#> 696 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459
#> 697 0.4731416 0.5080968 0.4731416 0.5561966 0.4731416 0.5080968 0.4731416
#> [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> 695 0.2679987 0.2679987 0.1761605 0.1761605 0.1761605 0.2679987 0.1761605
#> 696 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459
#> 697 0.4731416 0.4731416 0.5561966 0.5561966 0.5080968 0.4731416 0.5080968
#> [,15] [,16] [,17] [,18] [,19] [,20] [,21]
#> 695 0.1761605 0.1761605 0.2382262 0.2679987 0.1761605 0.2679987 0.2679987
#> 696 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459
#> 697 0.4731416 0.4731416 0.4731416 0.5080968 0.5080968 0.4731416 0.5054343
#> [,22] [,23] [,24] [,25] [,26] [,27] [,28]
#> 695 0.2679987 0.1761605 0.2679987 0.1761605 0.2382262 0.2679987 0.2382262
#> 696 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459
#> 697 0.5561966 0.5054343 0.5561966 0.4731416 0.4731416 0.5080968 0.5561966
#> [,29] [,30] [,31] [,32] [,33] [,34] [,35]
#> 695 0.1761605 0.1761605 0.1761605 0.1761605 0.2382262 0.1761605 0.1761605
#> 696 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459
#> 697 0.5080968 0.5080968 0.5054343 0.4731416 0.4731416 0.5080968 0.5080968
#> [,36] [,37] [,38] [,39] [,40] [,41] [,42]
#> 695 0.1761605 0.2679987 0.1761605 0.1761605 0.2679987 0.1761605 0.2679987
#> 696 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459
#> 697 0.4731416 0.5080968 0.5054343 0.4731416 0.5080968 0.4731416 0.4731416
#> [,43] [,44] [,45] [,46] [,47] [,48] [,49]
#> 695 0.2679987 0.1761605 0.2679987 0.2679987 0.2679987 0.2382262 0.2679987
#> 696 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459
#> 697 0.5080968 0.4731416 0.4731416 0.4731416 0.4731416 0.5054343 0.5080968
#> [,50] [,51] [,52] [,53] [,54] [,55] [,56]
#> 695 0.1761605 0.2679987 0.2679987 0.2679987 0.2679987 0.2382262 0.1761605
#> 696 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459
#> 697 0.5561966 0.5054343 0.4731416 0.4731416 0.4731416 0.5054343 0.4731416
#> [,57] [,58] [,59] [,60] [,61] [,62] [,63]
#> 695 0.2679987 0.1761605 0.2382262 0.2679987 0.2679987 0.1761605 0.1761605
#> 696 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459
#> 697 0.5561966 0.5080968 0.4731416 0.5080968 0.5054343 0.4731416 0.4731416
#> [,64] [,65] [,66] [,67] [,68] [,69] [,70]
#> 695 0.2382262 0.2679987 0.2679987 0.1761605 0.2382262 0.1761605 0.1761605
#> 696 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459
#> 697 0.4731416 0.5080968 0.4731416 0.4731416 0.5080968 0.4731416 0.4731416
#> [,71] [,72] [,73] [,74] [,75] [,76] [,77]
#> 695 0.2679987 0.2679987 0.2679987 0.2679987 0.2679987 0.2679987 0.2679987
#> 696 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459
#> 697 0.5080968 0.4731416 0.5080968 0.5080968 0.4731416 0.5080968 0.4731416
#> [,78] [,79] [,80] [,81] [,82] [,83] [,84]
#> 695 0.1761605 0.2679987 0.2679987 0.2679987 0.2679987 0.2679987 0.1761605
#> 696 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459
#> 697 0.5080968 0.5561966 0.5080968 0.4731416 0.5080968 0.4731416 0.5080968
#> [,85] [,86] [,87] [,88] [,89] [,90] [,91]
#> 695 0.2679987 0.1761605 0.1761605 0.1761605 0.1761605 0.2679987 0.2382262
#> 696 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459
#> 697 0.5054343 0.4731416 0.4731416 0.5561966 0.4731416 0.5561966 0.4731416
#> [,92] [,93] [,94] [,95] [,96] [,97] [,98]
#> 695 0.2679987 0.1761605 0.1761605 0.2382262 0.2679987 0.2679987 0.2679987
#> 696 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459 0.3543459
#> 697 0.5054343 0.4731416 0.4731416 0.5080968 0.4731416 0.5054343 0.5080968
#> [,99] [,100]
#> 695 0.1761605 0.1761605
#> 696 0.3543459 0.3543459
#> 697 0.5054343 0.4731416
Calculate summary prediction per individual for each approach by calculating the mean across predictions within individual.
These are the mean predictions calculated.
av_means
#> 695 696 697
#> 0.2309992 0.3468900 0.4952430
rb_means
#> 695 696 697
#> 0.2279884 0.3543459 0.4962898
We can also use the boxplot to display the full set of 100 generated prediction for each individual for the averaging method.
Calculate some summary measures on these calibrated predictions.
summary(t(pred_av[[2]]))
#> 695 696 697
#> Min. :0.1442 Min. :0.2798 Min. :0.4287
#> 1st Qu.:0.1881 1st Qu.:0.3280 1st Qu.:0.4711
#> Median :0.2393 Median :0.3446 Median :0.4904
#> Mean :0.2310 Mean :0.3469 Mean :0.4952
#> 3rd Qu.:0.2623 3rd Qu.:0.3637 3rd Qu.:0.5180
#> Max. :0.3120 Max. :0.4335 Max. :0.6058
We now investigate the rubin method results in more detail. First display the full set of predictions using a boxplot for each observation for the rubin method.
Note all predictions for the second individual are the same, as this person’s predictors were fully observed, and hence, we are applying a single - pooled - model on this fully observed predictor set. For the two other patients, missing values in the predictors are imputed as well, and thus we have 100 imputations of the predictor set for these individuals and hence also 100 predictions.
Calculate some summary measures on calibrated predictions.
summary(t(pred_rb[[2]]))
#> 695 696 697
#> Min. :0.1762 Min. :0.3543 Min. :0.4731
#> 1st Qu.:0.1762 1st Qu.:0.3543 1st Qu.:0.4731
#> Median :0.2382 Median :0.3543 Median :0.5054
#> Mean :0.2280 Mean :0.3543 Mean :0.4963
#> 3rd Qu.:0.2680 3rd Qu.:0.3543 3rd Qu.:0.5081
#> Max. :0.2680 Max. :0.3543 Max. :0.5562
Again, we can note that while the combined predictions look similar between the methods, the dispersal of the constituent predictions about the pooled prediction looks very different because the rubin approach only uses the pooled model. Particularly, there is no variation at all for the middle observation because this record is completely observed without missing values.
There is more functionality in mipred
than demonstrated in the above example data analyses. We discuss a number of features in turn.
A useful simplification is that we may as well use the below code to generate predictions using all predictors present in the data.frame, as opposed to explicitly quoting each predictor in turn in the right-hand side of the formula. For example, for cross-validation with the averaging approach using mipred.cv, the code
specifies identical models to the code used previously. Similarly, to calibrate predictions on new observations using the averaging method, we might as well have used
pred_av <-
mipred(
Status ~ .,
family = binomial,
data = cll_bin[, -1],
newdata = cll_bin_new[, -1],
nimp = 100,
folds = 1
)
Note that in this case however, the use of the -1
notation to deselect the first column (containing the patient identifier) becomes essential, as otherwise it will automatically be added to the model as a predictor.
A special application of formulae which is allowed in mipred
is the use of transformations in the prediction of outcome. Imagine for example that in the immediately preceding model, we feel a quadratic transform is required for age10
and that we only need to consider the age
, gender
and cyto effect
. This could be achieved by specifying the code as
predcv_av <-
mipred.cv(
Status ~ age10 + I(age10 ^ 2) + sex_match + cyto,
family = binomial,
data = cll_bin[,-1],
nimp = 10,
folds = 10
)
This does however raise the interesting question whether the outcome prediction models and imputation model for outcome used internally by mipred should be compatible. In the above example, mipred
uses the quadratic term for age10
when calibrating the outcome models. The imputation models will however remain completely unaffected and identical to those used in the previous model, unless we explicitly force mipred
to employ the quadratic transform for age10
for imputations as well. Mipred
does not enforce compatibility and it is possible to have full flexibility between both. We anticipate however that it will be problematic for most applications to have distinct and/or incompatible models for prediction and outcome imputation, so this aspect should be given some thought in practical applications.
If desired, compatible models can be specified through use of the mice.options
when calling either mipred
or mipred.cv
, specifically the options formulas
, blocks
and predictorMatrix
. Note that similar concerns apply when including additional interactions, log-transforms and so on. Likewise, although not impossible with the software, it is actually ill-advised to exclude predictors from the imputation models when these were previously used in the prediction of the outcome for same reasons.
In the analysis and code shown in the previous section, the full set of predictors is used for both the calibration of the prediction models as well as for the estimation of the imputations. Furthermore, the predictors are entered as a simple linear combination (on associated dummy variates for the categorical measures) in the estimation of the prediction model. The mipred
package however allows for considerable flexibility in specification of both the imputation models as well as for the dependence of prediction models on the available predictors.
For example, if for some reason we would only wish to calibrate regression models on age10
, remstat
and cyto
, while retaining all other predictors for use in the estimation of imputations, then this is simply achieved by specifying
predcv_av <-
mipred.cv(
Status ~ age10 + sex_match + cyto,
family = binomial,
data = cll_bin[,-1],
nimp = 10,
folds = 10
)
for cross-validation and similarly, in the estimation of predictions on future observations with the function mipred
.
It is of interest to note that with a single outcome and when there are p
predictors in total, of which k
predictors contain missing values (\(k \le p\)), mipred
does not calculate a single prediction model. Rather, it calibrates the final prediction model as well as k+1
imputation models (thus k+2 models in total). The reason that k+1
imputation models are fit in full generality is that - for mipred.cv
for example - outcomes are artifically set to missing in the cross-validation when generating imputations on the set-aside validation folds. Thus, an additional imputation model is required for the outcome. Imputations for outcomes are only used when imputing predictors and further discarded.
If all predictors contain missing values, then p+2
models are fit and these may also be explicitly specified. The two arguments (formula
and family
) in the mipred
and mipred.cv
functions only define the first prediction model which is used for the outcome (this is often referred to as the ‘substantive’ model in literature). Without further specification, mice will internally resort to default settings (see mice
documentation) and use of all predictors as well as the outcome for calibration of any missing values. If desired however, then mice options can be specified to tailor the missing values models as required.
Imagine for example, that we wish to use all predictors in the data.frame to predict the outcome, but would wish to remove certain predictors from the calibration of imputation models. This could be done as shown below. To simplify some code generation, we first run a regular mice imputation on the data.frame (excluding the identification variable). We save the result in an object.
Now inspect the mice
imputation object and particularly the predictorMatrix
embedded in this object.
names(imp)
#> [1] "data" "imp" "m"
#> [4] "where" "blocks" "call"
#> [7] "nmis" "method" "predictorMatrix"
#> [10] "visitSequence" "formulas" "post"
#> [13] "blots" "seed" "iteration"
#> [16] "lastSeedValue" "chainMean" "chainVar"
#> [19] "loggedEvents" "version" "date"
imp$predictorMatrix
#> age10 perfstat remstat cyto asct donor sex_match cond Status
#> age10 0 1 1 1 1 1 1 1 1
#> perfstat 1 0 1 1 1 1 1 1 1
#> remstat 1 1 0 1 1 1 1 1 1
#> cyto 1 1 1 0 1 1 1 1 1
#> asct 1 1 1 1 0 1 1 1 1
#> donor 1 1 1 1 1 0 1 1 1
#> sex_match 1 1 1 1 1 1 0 1 1
#> cond 1 1 1 1 1 1 1 0 1
#> Status 1 1 1 1 1 1 1 1 0
The predictorMatrix
allows users to set predictors used in the estimation of imputations. Imagine that we would wish to exclude age10
as an available predictor in the estimation of missing observations. This could be achieved by changing the predictorMatrix
as shown here (see mice
instruction documents for details on this procedure).
predmat <- imp$predictorMatrix
predmat[,c("age10")] <- 0
predmat
#> age10 perfstat remstat cyto asct donor sex_match cond Status
#> age10 0 1 1 1 1 1 1 1 1
#> perfstat 0 0 1 1 1 1 1 1 1
#> remstat 0 1 0 1 1 1 1 1 1
#> cyto 0 1 1 0 1 1 1 1 1
#> asct 0 1 1 1 0 1 1 1 1
#> donor 0 1 1 1 1 0 1 1 1
#> sex_match 0 1 1 1 1 1 0 1 1
#> cond 0 1 1 1 1 1 1 0 1
#> Status 0 1 1 1 1 1 1 1 0
Now we run mipred
with the predictorMatrix
option set as
predcv_av <-
mipred.cv(
Status ~ age10 + sex_match + cyto,
family = binomial,
data = cll_bin[,-1],
nimp = 10,
folds = 10,
mice.options = list(predictorMatrix = predmat)
)
While the above analyses can be implemented, in practice and in most cases, we anticipate it is more likely that we would wish to exclude predictors from the outcome model, but retain these for estimation of imputations. Removing variables from imputation calculations which are however used in the substantive prediction model could be ill-advised.
The default for the cross-validation function mipred.cv
is to use leave-one-out prediction, such that each fold contains a single observation only, with the total number of folds equal to the sample size (nrow(data)
). The minimum number of folds which can be specified is 2, which corresponds to splitting the data into roughly two equally-sized partitions at random.
The prediction function mipred
works similarly and will split the data to be predicted (newdata
) into singleton folds by default (folds=m
, with m=nrow(newdata)
). This implies that for the prediction of each singleton, a completely new set of imputations (as specified by nimp
) will be generated from the available data (data
) and used in the construction of the classifier for the prediction of that singleton. Importantly, this also means that all m-1
other data observations in newdata
are left completely unused, either for the prediction of any observation in the new dataset, nor do they contribute to the imputations. Any other number from 1 to nrow(newdata)
may be specified for the folds on newdata
, where the other most extreme option folds=1
means that no partitioning will take place in newdata
at all. This means that imputations will be generated for the unobserved predictor data in newdata
all at once by adding the full set of new observations to the original data and then generating imputations across the entire combined data set. Likewise, after imputation, prediction models are fit on the (calibration) portion corresponding to data
and the models are then applied to the imputed data in newdata
to generate the predictions. Note that this also implies that the (observed) predictor data from any observation in newdata
will contribute to the imputation of unobserved covariates of any other observation in newdata
. Setting folds
to any other value between 1 or nrow(newdata)
means that the last described procedure is applied to each of the folds in newdata
separately. Data from other folds is completely disregarded when generating predictions for any fold, just as explained for the prediction of singleton folds above.
From the computational point of view, using leave-one-out prediction will obviously be the most expensive and time-consuming option for both functions, as imputations need to be re-calculated for each fold. Computational complexity is discussed further below.
The mipred
and mipred.cv
functions utilize multiple imputations, which involves random sampling. This implies that between different runs of the same code on the same data, results will usually not replicate exactly, although they should be comparable from a qualitative point of view. In some situations it may be required to ensure exact replication of analytic results, which means we should precisely control the random number generation. Since mipred
and mipred.cv
make use of random number generation at two distinct points, the first for the generation of the fold definitions and the second when imputing using the mice
package, we need to set seeds for both. The below code shows seed specification for a call to mipred.cv
, which uses 10 folds and 5 imputations per fold, hence 50 imputations in total.
seed<-round(runif(10*5)*10000) # 10*5 because 10 folds by 5 imputations (for each fold)
# hence 10*5 separate calls to mice
mice.options<-NULL
mice.options$seed<-seed
set.seed(12345) # need to set seed explicitly here as well due to random generation for
# fold definition (outside mice function)
predcv_av <-
mipred.cv(
Status ~ age10 + sex_match + cyto,
family = binomial,
data = cll_bin[,-1],
nimp = 5,
folds = 10,
mice.options=list(maxit=5, printFlag=FALSE, seed=seed)
)
If you specify a seed
vector of length different from 50, then either only the first 50 elements will be used if length(seed)>50
or else, numbers will be recycled. In both cases a warning is generated.
When using the option method="rubin"
, all imputations are directly calculated within a single call to mice
for each fold and hence, we need only specify folds
number of seeds. In this case, the code would become
seed<-round(runif(10)*10000) # 10 because 10 folds and all imputations generated directly for each fold
# the 5 imputations are directly generated in one call to the `mice`function for each fold
mice.options<-NULL
mice.options$seed<-seed
set.seed(12345) # need to set seed explicitly here as well due to random generation for
# fold definition (outside mice function)
predcv_av <-
mipred.cv(
Status ~ age10 + sex_match + cyto,
family = binomial,
data = cll_bin[,-1],
nimp = 5,
folds = 10,
method="rubin",
mice.options=list(maxit=5, printFlag=FALSE, seed=seed)
)
Similar code applies for seed specification when using the function mipred
.
Arguments to mice
can be specified, but must be explicitly stored in an object and declared to the argument mice.options
as also shown in the last example. If for example, we want more verbose printout from the internal calls to mice
, then we could change the printFlag
argument to TRUE
within the list provided to mice.options
. See the mice
and mipred
documentation for a description of the mice
functionality you can control through the mice.options
argument.
Some caution is required with categorical predictors when certain factor levels are sparsely populated. For the CLL
data this is the case for the perfstat
variable, which has only 25 observations at factor level Karnofsky <=70
. This can cause problems with the cross-validation procedure mipred.cv
, when the number of folds is chosen too low, such as in the below code.
mipred.cv(
Status ~ perfstat+remstat+cyto,
family = binomial,
data = cll_bin[,-1],
nimp = 5,
folds = 2
)
The problem arises when all observation at a specific factor level are assigned to one of the folds only. This causes the factor level to be empty in the corresponding calibration set when that fold is chosen as the left-out set and as a consequence, the corresponding regression effect cannot be estimated on the calibration data. This causes an error when new factor levels are encountered when the model is applied to the left-out fold.
We should note that this however is not an actual implementation fault of mipred
, but rather a data analytic issue. The simplest solution is to combine (adjacent) factor levels to reduce sparsity. Alternatively, we may increase the number of folds.
Computational speeds of mipred
and mipred.cv
particularly will depend on the combination of options folds
(number of folds), nimp
(number of imputations) and the mice
option maxit
. The latter is an internal option within the mice
package which specifies the number of iterations for which the chained equations are run, before storing the final sampled values as imputation). Note that mipred
overrides the mice
option maxit
and sets maxit
to value 50 by default, which is different from the default option 5 used by the mice
package. Increasing any of these 3 parameters will increase computation time. Usually, for the same setting of these parameters, running mipred
functions with method="rubin"
will be faster than for the default setting method="averaging"
because fewer calls to mice
are involved (see also above discussion on setting seeds). An easy way to achieve computational gains for the same specification of folds
and nimp
would be to reduce maxit
(possibly to the default mice
setting of five iterations).
We thank EBMT and DKMS for their work in collecting and preparing the CLL data and for approval to share the data. We thank Johannes Schetelig (University Hospital of the Technical University Dresden/DKMS Clinical Trials Unit, Dresden, Germany) for the extensive joint work in collaboration with L. de Wreede on the previous analyses on these data. These results have been published in two publications: Schetelig, Wreede, and Gelder (2017) and Schetelig, Wreede, and Andersen (2017).
Mertens, B.J.A., E. Banzato, and L. de Wreede. 2018. “Construction and Assessment of Prediction Rules for Binary Outcome in the Presence of Missing Predictor Data Using Multiple Imputation: Theoretical Perspective and Data-Based Evaluation.” https://arxiv.org/abs/1810.05099.
Schetelig, J., L.C. de Wreede, and et al. Andersen N.S. 2017. “Centre Characteristics and Procedure-Related Factors Have an Impact on Outcomes of Allogeneic Transplantation for Patients with Cll: A Retrospective Analysis from the European Society for Blood and Marrow Transplantation (Ebmt).” British Journal of Haematology 178: 521–33.
Schetelig, J., L.C. de Wreede, and M. et al. van Gelder. 2017. “Risk Factors for Treatment Failure After Allogeneic Transplantation of Patients with Cll: A Report from the European Society for Blood and Marrow Transplantation.” Bone Marrow Transplantation 52: 552–60.