The aim of this vignette is to introduce the R package missRanger
for imputation of missing values and to explain how to use it for multiple imputation.
missRanger
uses the ranger
package [1] to do fast missing value imputation by chained random forest. As such, it can be used as an alternative to missForest
, a beautiful algorithm introduced by Stekhoven and Buehlmann in [2]. Basically, each variable is imputed by predictions from a random forest using all other variables as covariables. missRanger
iterates multiple times over all variables until the average out-of-bag prediction error of the models stops to improve.
Why should you consider missRanger
?
It is fast.
It is flexible and intuitive to apply: E.g. calling missRanger(data, . ~ 1)
would impute all variables univariately, missRanger(data, Species ~ Sepal.Width)
would use Sepal.Width
to impute Species
.
It can deal with most realistic variable types, even dates and times without destroying the original data structure.
It combines random forest imputation with predictive mean matching. This generates realistic variability and avoids “new” values like 0.3334 in a 0-1 coded variable. Like this, missRanger
can be used for realistic multiple imputation scenarios, see e.g. [4] for the statistical background.
In the examples below, we will meet two functions from the missRanger
package:
generateNA
: To replace values in a data set by missing values.
missRanger
: To impute missing values in a data frame.
From CRAN:
install.packages("missRanger")
Latest version from github:
library(devtools)
install_github("mayer79/missRanger/release/missRanger")
We first generate a data set with about 20% missing values per column and fill them again by missRanger
.
library(missRanger)
set.seed(84553)
head(iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
# Generate data with missing values in all columns
head(irisWithNA <- generateNA(iris, p = 0.2))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 NA 0.2 setosa
#> 2 4.9 3.0 1.4 NA setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 NA NA NA 0.2 setosa
#> 5 5.0 3.6 1.4 NA setosa
#> 6 5.4 3.9 NA 0.4 setosa
# Impute missing values with missRanger
head(irisImputed <- missRanger(irisWithNA, num.trees = 100))
#>
#> Missing value imputation by random forests
#>
#> Variables to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> Variables used to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> iter 1: .....
#> iter 2: .....
#> iter 3: .....
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.100000 3.500000 1.503583 0.2000000 setosa
#> 2 4.900000 3.000000 1.400000 0.2845833 setosa
#> 3 4.700000 3.200000 1.300000 0.2000000 setosa
#> 4 5.673567 3.273117 2.505867 0.2000000 setosa
#> 5 5.000000 3.600000 1.400000 0.1914333 setosa
#> 6 5.400000 3.900000 1.509900 0.4000000 setosa
It worked! Unfortunately, the new values look somewhat unnatural due to different rounding. If we would like to avoid this, we just set the pmm.k
argument to a positive number. All imputations done during the process are then combined with a predictive mean matching (PMM) step, leading to more natural imputations and improved distributional properties of the resulting values:
head(irisImputed <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100))
#>
#> Missing value imputation by random forests
#>
#> Variables to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> Variables used to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> iter 1: .....
#> iter 2: .....
#> iter 3: .....
#> iter 4: .....
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 5.8 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.4 0.4 setosa
Note that missRanger
offers a ...
argument to pass options to ranger
, e.g. num.trees
or min.node.size
. How would we use its “extra trees” variant with 50 trees?
head(irisImputed_et <- missRanger(irisWithNA, pmm.k = 3, splitrule = "extratrees", num.trees = 50))
#>
#> Missing value imputation by random forests
#>
#> Variables to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> Variables used to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> iter 1: .....
#> iter 2: .....
#> iter 3: .....
#> iter 4: .....
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.3 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.8 2.7 1.3 0.2 setosa
#> 5 5.0 3.6 1.4 0.4 setosa
#> 6 5.4 3.9 1.3 0.4 setosa
It is as simple!
Further note that missRanger
does not rely on tidyverse
but you can embed it into a dplyr
pipeline (without group_by
). Make sure to set verbose = 0
in order to prevent messages.
require(dplyr)
iris %>%
generateNA %>%
as_tibble %>%
missRanger(verbose = 0) %>%
head
#> # A tibble: 6 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.310 setosa
#> 2 4.9 3.04 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.21 1.48 0.2 setosa
#> 6 6.69 3.9 3.08 0.4 setosa
By default missRanger
uses all columns in the data set to impute all columns with missings. To override this behaviour, you can use an intuitive formula interface: The left hand side specifies the variables to be imputed (variable names separated by a +
), while the right hand side lists the variables used for imputation.
# Impute all variables with all (default behaviour). Note that variables without
# missing values will be skipped from the left hand side of the formula.
head(m <- missRanger(irisWithNA, formula = . ~ ., pmm.k = 3, num.trees = 10))
#>
#> Missing value imputation by random forests
#>
#> Variables to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> Variables used to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> iter 1: .....
#> iter 2: .....
#> iter 3: .....
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.5 0.2 setosa
#> 2 4.9 3.0 1.4 0.3 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 5.9 3.0 4.1 0.2 setosa
#> 5 5.0 3.6 1.4 0.4 setosa
#> 6 5.4 3.9 1.6 0.4 setosa
# Same
head(m <- missRanger(irisWithNA, pmm.k = 3, num.trees = 10))
#>
#> Missing value imputation by random forests
#>
#> Variables to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> Variables used to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> iter 1: .....
#> iter 2: .....
#> iter 3: .....
#> iter 4: .....
#> iter 5: .....
#> iter 6: .....
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.5 0.2 setosa
#> 2 4.9 3.0 1.4 0.5 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 5.5 3.4 1.9 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.4 0.4 setosa
# Impute all variables with all except Species
head(m <- missRanger(irisWithNA, . ~ . - Species, pmm.k = 3, num.trees = 10))
#>
#> Missing value imputation by random forests
#>
#> Variables to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> Variables used to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
#> iter 1: .....
#> iter 2: .....
#> iter 3: .....
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.3 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 6.4 3.1 3.8 0.2 setosa
#> 5 5.0 3.6 1.4 0.1 setosa
#> 6 5.4 3.9 1.4 0.4 setosa
# Impute Sepal.Width by Species
head(m <- missRanger(irisWithNA, Sepal.Width ~ Species, pmm.k = 3, num.trees = 10))
#>
#> Missing value imputation by random forests
#>
#> Variables to impute: Sepal.Width
#> Variables used to impute:
#> iter 1: .
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 NA 0.2 setosa
#> 2 4.9 3.0 1.4 NA setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 NA 3.0 NA 0.2 setosa
#> 5 5.0 3.6 1.4 NA setosa
#> 6 5.4 3.9 NA 0.4 setosa
# No success. Why? Species contains missing values and thus can only be used for imputation if it is being imputed as well
head(m <- missRanger(irisWithNA, Sepal.Width + Species ~ Species, pmm.k = 3, num.trees = 10))
#>
#> Missing value imputation by random forests
#>
#> Variables to impute: Sepal.Width, Species
#> Variables used to impute: Species
#> iter 1: ..
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 NA 0.2 setosa
#> 2 4.9 3.0 1.4 NA setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 NA 3.2 NA 0.2 setosa
#> 5 5.0 3.6 1.4 NA setosa
#> 6 5.4 3.9 NA 0.4 setosa
# Impute all variables univariatly
head(m <- missRanger(irisWithNA, . ~ 1))
#>
#> Missing value imputation by random forests
#>
#> Variables to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> Variables used to impute:
#> iter 1: .....
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.9 0.2 setosa
#> 2 4.9 3.0 1.4 2.0 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 4.1 5.6 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 5.9 0.4 setosa
missRanger
is based on iteratively fitting random forests for each variable with missing values. Since the underlying random forest implementation ranger
uses 500 trees per default, a huge number of trees might be calculated. For larger data sets, the overall process can take very long.
Here are tweaks to make things faster:
Use less trees, e.g. by setting num.trees = 50
. Even one single tree might be sufficient. Typically, the number of iterations until convergence will increase with fewer trees though.
Use smaller bootstrap samples by setting e.g. sample.fraction = 0.1
.
Use the less greedy splitrule = "extratrees"
.
Use a low tree depth max.depth = 6
.
Use large leafs, e.g. min.node.size = 10000
.
Use a low max.iter
, e.g. 1 or 2.
require(ggplot2) # for diamonds data
dim(diamonds) # 53940 10
diamonds_with_NA <- generateNA(diamonds)
# Takes 270 seconds (10 * 500 trees per iteration!)
system.time(m <- missRanger(diamonds_with_NA, pmm.k = 3))
# Takes 19 seconds
system.time(m <- missRanger(diamonds_with_NA, pmm.k = 3, num.trees = 50))
# Takes 7 seconds
system.time(m <- missRanger(diamonds_with_NA, pmm.k = 3, num.trees = 1))
# Takes 9 seconds
system.time(m <- missRanger(diamonds_with_NA, pmm.k = 3, num.trees = 50, sample.fraction = 0.1))
case.weights
to weight down contribution of rows with many missingsUsing the case.weights
argument, you can pass case weights to the imputation models. This might be e.g. useful to weight down the contribution of rows with many missings.
# Count the number of non-missing values per row
non_miss <- rowSums(!is.na(irisWithNA))
table(non_miss)
#> non_miss
#> 1 2 3 4 5
#> 2 6 28 68 46
# No weighting
head(m <- missRanger(irisWithNA, num.trees = 20, pmm.k = 3, seed = 5))
#>
#> Missing value imputation by random forests
#>
#> Variables to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> Variables used to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> iter 1: .....
#> iter 2: .....
#> iter 3: .....
#> iter 4: .....
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.5 0.2 setosa
#> 2 4.9 3.0 1.4 0.1 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 5.7 3.8 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.5 0.4 setosa
# Weighted by number of non-missing values per row.
head(m <- missRanger(irisWithNA, num.trees = 20, pmm.k = 3, seed = 5, case.weights = non_miss))
#>
#> Missing value imputation by random forests
#>
#> Variables to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> Variables used to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> iter 1: .....
#> iter 2: .....
#> iter 3: .....
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.3 0.2 setosa
#> 2 4.9 3.0 1.4 0.1 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 5.4 3.4 1.4 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.1 0.4 setosa
missRanger
in multiple imputation settings?For machine learning tasks, imputation is typically seen as a fixed data preparation step like dummy coding. There, multiple imputation is rarely applied as it adds another level of complexity to the analysis. This might be fine since a good validation schema will account for variation introduced by imputation.
For tasks with focus on statistical inference (p values, standard errors, confidence intervals, estimation of effects), the extra variability introduced by imputation has to be accounted for except if only very few missing values appear. One of the standard approaches is to impute the data set multiple times, generating e.g. 10 or 100 versions of a complete data set. Then, the intended analysis (t-test, linear model etc.) is applied independently to each of the complete data sets. Their results are combined afterward in a pooling step, usually by Rubin’s rule [4]. For parameter estimates, averages are taken. Their variance is basically a combination of the average squared standard errors plus the variance of the parameter estimates across the imputed data sets, leading to inflated standard errors and thus larger p values and wider confidence intervals.
The package mice
[3] takes case of this pooling step. The creation of multiple complete data sets can be done by mice
or also by missRanger
. In the latter case, in order to keep the variance of imputed values at a realistic level, we suggest to use predictive mean matching on top of the random forest imputations.
The following example shows how easy such workflow looks like.
irisWithNA <- generateNA(iris, p = c(0, 0.1, 0.1, 0.1, 0.1))
# Generate 20 complete data sets
filled <- replicate(20, missRanger(irisWithNA, verbose = 0, num.trees = 100, pmm.k = 5), simplify = FALSE)
# Run a linear model for each of the completed data sets
models <- lapply(filled, function(x) lm(Sepal.Length ~ ., x))
# Pool the results by mice
require(mice)
summary(pooled_fit <- pool(models))
#> estimate std.error statistic df p.value
#> (Intercept) 2.3237581 0.31344494 7.413609 79.09153 1.191021e-10
#> Sepal.Width 0.4820868 0.09965924 4.837352 70.91717 7.413834e-06
#> Petal.Length 0.7602107 0.07799618 9.746768 75.59176 5.329071e-15
#> Petal.Width -0.2594143 0.15886858 -1.632886 77.07308 1.065721e-01
#> Speciesversicolor -0.6312472 0.27946995 -2.258730 59.13447 2.760202e-02
#> Speciesvirginica -0.8448732 0.38541621 -2.192106 59.56188 3.229175e-02
# Compare with model on original data
summary(lm(Sepal.Length ~ ., data = iris))
#>
#> Call:
#> lm(formula = Sepal.Length ~ ., data = iris)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.79424 -0.21874 0.00899 0.20255 0.73103
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2.17127 0.27979 7.760 1.43e-12 ***
#> Sepal.Width 0.49589 0.08607 5.761 4.87e-08 ***
#> Petal.Length 0.82924 0.06853 12.101 < 2e-16 ***
#> Petal.Width -0.31516 0.15120 -2.084 0.03889 *
#> Speciesversicolor -0.72356 0.24017 -3.013 0.00306 **
#> Speciesvirginica -1.02350 0.33373 -3.067 0.00258 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.3068 on 144 degrees of freedom
#> Multiple R-squared: 0.8673, Adjusted R-squared: 0.8627
#> F-statistic: 188.3 on 5 and 144 DF, p-value: < 2.2e-16
The standard errors and p values of the multiple imputation are larger than of the original data set. This reflects the additional uncertainty introduced by the presence of missing values in a realistic way.
There is no obvious way of how to deal with survival variables as covariables in imputation models.
Options discussed in White [5] include:
Use both status variable s and (censored) time variable t
s and log(t)
surv(t), and, optionally s
By surv(t), we denote the Nelson-Aalen survival estimate at each value of t. The third option is the most elegant one as it explicitly deals with censoring information. We provide some additional details on it in the example
require(survival)
require(dplyr)
head(veteran)
#> trt celltype time status karno diagtime age prior
#> 1 1 squamous 72 1 60 7 69 0
#> 2 1 squamous 411 1 70 5 64 10
#> 3 1 squamous 228 1 60 3 38 0
#> 4 1 squamous 126 1 60 9 63 10
#> 5 1 squamous 118 1 70 11 65 10
#> 6 1 squamous 10 1 20 5 49 0
set.seed(653)
# For illustration, we use data from a randomized two-arm trial
# about lung cancer. The aim is to estimate the treatment effect
# of "trt" with reliable inference using Cox regression. Unfortunately,
# we generated missing values in the covariables "age" and "karno" (performance
# status). One approach is to use multiple imputation, see the section above.
# It is recommended to use the model response in the imputation models -
# even if it sounds wrong. In case of a censored survival response
# (i.e. consisting of a time/status pair), an elegant
# possibility is to represent it by the estimated Nelson-Aalen estimates [5].
# Add the Nelson-Aalen survival probabilities "surv" to the data set
veteran2 <- summary(survfit(Surv(time, status) ~ 1, data = veteran),
times = veteran$time)[c("time", "surv")] %>%
as_tibble %>%
right_join(veteran, by = "time")
# Add missing values to some columns. We do not add missing values
# in the survival information as this is usually the response of the (Cox-)
# modelling process following the imputation.
veteran_with_NA <- generateNA(veteran2, p = c(age = 0.1, karno = 0.1, diagtime = 0.1))
# Generate 20 complete data sets and remove "surv"
filled <- replicate(20, missRanger(veteran_with_NA, . ~ . - time - status,
verbose = 0, pmm.k = 3, num.trees = 50), simplify = FALSE)
filled <- lapply(filled, function(data) {data$surv <- NULL; data})
# Run a Cox proportional hazards regression for each of the completed data sets
models <- lapply(filled, function(x) coxph(Surv(time, status) ~ ., x))
# Pool the results by mice
require(mice)
summary(pooled_fit <- pool(models))
#> estimate std.error statistic df
#> trt 0.201959473 0.159267360 1.2680531 6176.473
#> celltypesmallcell 0.836528671 0.214067771 3.9077749 53541.171
#> celltypeadeno 1.079369471 0.238737193 4.5211618 5714.099
#> celltypelarge 0.327385574 0.243147111 1.3464506 5849.810
#> karno -0.035056963 0.004379610 -8.0045848 2621.609
#> diagtime -0.003081890 0.006138537 -0.5020561 10297.018
#> age -0.005642109 0.007954351 -0.7093111 775.675
#> prior 0.007778027 0.018826526 0.4131419 12385.627
#> p.value
#> trt 2.048268e-01
#> celltypesmallcell 9.326449e-05
#> celltypeadeno 6.274303e-06
#> celltypelarge 1.782094e-01
#> karno 1.776357e-15
#> diagtime 6.156388e-01
#> age 4.783446e-01
#> prior 6.795098e-01
# On the original
summary(coxph(Surv(time, status) ~ ., veteran))
#> Call:
#> coxph(formula = Surv(time, status) ~ ., data = veteran)
#>
#> n= 137, number of events= 128
#>
#> coef exp(coef) se(coef) z Pr(>|z|)
#> trt 2.946e-01 1.343e+00 2.075e-01 1.419 0.15577
#> celltypesmallcell 8.616e-01 2.367e+00 2.753e-01 3.130 0.00175 **
#> celltypeadeno 1.196e+00 3.307e+00 3.009e-01 3.975 7.05e-05 ***
#> celltypelarge 4.013e-01 1.494e+00 2.827e-01 1.420 0.15574
#> karno -3.282e-02 9.677e-01 5.508e-03 -5.958 2.55e-09 ***
#> diagtime 8.132e-05 1.000e+00 9.136e-03 0.009 0.99290
#> age -8.706e-03 9.913e-01 9.300e-03 -0.936 0.34920
#> prior 7.159e-03 1.007e+00 2.323e-02 0.308 0.75794
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> exp(coef) exp(-coef) lower .95 upper .95
#> trt 1.3426 0.7448 0.8939 2.0166
#> celltypesmallcell 2.3669 0.4225 1.3799 4.0597
#> celltypeadeno 3.3071 0.3024 1.8336 5.9647
#> celltypelarge 1.4938 0.6695 0.8583 2.5996
#> karno 0.9677 1.0334 0.9573 0.9782
#> diagtime 1.0001 0.9999 0.9823 1.0182
#> age 0.9913 1.0087 0.9734 1.0096
#> prior 1.0072 0.9929 0.9624 1.0541
#>
#> Concordance= 0.736 (se = 0.021 )
#> Likelihood ratio test= 62.1 on 8 df, p=2e-10
#> Wald test = 62.37 on 8 df, p=2e-10
#> Score (logrank) test = 66.74 on 8 df, p=2e-11
Originally, missRanger
could deal only with factors and numeric variables. Since Release 2.1.0, most reasonable types are supported, including dates, date times etc. If there are problems with some special column type, you still have the option to convert it yourself or exclude it by the formula interface explained above.
ir <- iris
ir$s <- iris$Species == "setosa"
ir$dt <- seq(Sys.time(), by = "1 min", length.out = 150)
ir$d <- seq(Sys.Date(), by = "1 d", length.out = 150)
ir$ch <- as.character(iris$Species)
head(ir <- generateNA(ir, c(rep(0.2, 7), 0, 0)))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species s
#> 1 5.1 3.5 1.4 0.2 setosa TRUE
#> 2 4.9 3.0 1.4 NA setosa TRUE
#> 3 4.7 3.2 1.3 0.2 setosa TRUE
#> 4 4.6 3.1 NA NA setosa TRUE
#> 5 NA NA 1.4 0.2 setosa TRUE
#> 6 NA 3.9 1.7 0.4 setosa TRUE
#> dt d ch
#> 1 2019-06-30 22:35:44 2019-06-30 setosa
#> 2 2019-06-30 22:36:44 2019-07-01 setosa
#> 3 2019-06-30 22:37:44 2019-07-02 setosa
#> 4 2019-06-30 22:38:44 2019-07-03 setosa
#> 5 <NA> 2019-07-04 setosa
#> 6 2019-06-30 22:40:44 2019-07-05 setosa
head(m <- missRanger(ir, pmm.k = 4))
#>
#> Missing value imputation by random forests
#>
#> Variables to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species, s, dt
#> Variables used to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species, s, dt, d, ch
#> iter 1: .......
#> iter 2: .......
#> iter 3: .......
#> iter 4: .......
#> iter 5: .......
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species s
#> 1 5.1 3.5 1.4 0.2 setosa TRUE
#> 2 4.9 3.0 1.4 0.1 setosa TRUE
#> 3 4.7 3.2 1.3 0.2 setosa TRUE
#> 4 4.6 3.1 1.5 0.2 setosa TRUE
#> 5 4.9 3.8 1.4 0.2 setosa TRUE
#> 6 5.1 3.9 1.7 0.4 setosa TRUE
#> dt d ch
#> 1 2019-06-30 22:35:44 2019-06-30 setosa
#> 2 2019-06-30 22:36:44 2019-07-01 setosa
#> 3 2019-06-30 22:37:44 2019-07-02 setosa
#> 4 2019-06-30 22:38:44 2019-07-03 setosa
#> 5 2019-06-30 22:36:44 2019-07-04 setosa
#> 6 2019-06-30 22:40:44 2019-07-05 setosa
[1] Wright, M. N. & Ziegler, A. (2016). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, in press. http://arxiv.org/abs/1508.04409.
[2] Stekhoven, D.J. and Buehlmann, P. (2012). MissForest - nonparametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118.
[3] Van Buuren, S. and Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. http://www.jstatsoft.org/v45/i03/
[4] Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley and Sons.
[5] White IR and Royston P. (2009). Imputing missing covariate values for the Cox model. Statistics in medicine, 28(15), 1982-1998.