Using missRanger

Introduction

The aim of this vignette is to introduce the R package missRanger for imputation of missing values and to explain how to use it for multiple imputation.

missRanger uses the ranger package [1] to do fast missing value imputation by chained random forest. As such, it can be used as an alternative to missForest, a beautiful algorithm introduced by Stekhoven and Buehlmann in [2]. Basically, each variable is imputed by predictions from a random forest using all other variables as covariables. missRanger iterates multiple times over all variables until the average out-of-bag prediction error of the models stops to improve.

Why should you consider missRanger?

It is fast.
It is flexible and intuitive to apply: E.g. calling missRanger(data, . ~ 1) would impute all variables univariately, missRanger(data, Species ~ Sepal.Width) would use Sepal.Width to impute Species.
It can deal with most realistic variable types, even dates and times without destroying the original data structure.
It combines random forest imputation with predictive mean matching. This generates realistic variability and avoids “new” values like 0.3334 in a 0-1 coded variable. Like this, missRanger can be used for realistic multiple imputation scenarios, see e.g. [4] for the statistical background.

In the examples below, we will meet two functions from the missRanger package:

generateNA: To replace values in a data set by missing values.
missRanger: To impute missing values in a data frame.

Installation

From CRAN:

install.packages("missRanger")

Latest version from github:

library(devtools)
install_github("mayer79/missRanger/release/missRanger")

Examples

We first generate a data set with about 20% missing values per column and fill them again by missRanger.

library(missRanger)
set.seed(84553)

head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

# Generate data with missing values in all columns
head(irisWithNA <- generateNA(iris, p = 0.2))
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5           NA         0.2  setosa
#> 2          4.9         3.0          1.4          NA  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4           NA          NA           NA         0.2  setosa
#> 5          5.0         3.6          1.4          NA  setosa
#> 6          5.4         3.9           NA         0.4  setosa
 
# Impute missing values with missRanger
head(irisImputed <- missRanger(irisWithNA, num.trees = 100))
#> 
#> Missing value imputation by random forests
#> 
#>   Variables to impute:       Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#>   Variables used to impute:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> iter 1:  .....
#> iter 2:  .....
#> iter 3:  .....
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1     5.100000    3.500000     1.503583   0.2000000  setosa
#> 2     4.900000    3.000000     1.400000   0.2845833  setosa
#> 3     4.700000    3.200000     1.300000   0.2000000  setosa
#> 4     5.673567    3.273117     2.505867   0.2000000  setosa
#> 5     5.000000    3.600000     1.400000   0.1914333  setosa
#> 6     5.400000    3.900000     1.509900   0.4000000  setosa

It worked! Unfortunately, the new values look somewhat unnatural due to different rounding. If we would like to avoid this, we just set the pmm.k argument to a positive number. All imputations done during the process are then combined with a predictive mean matching (PMM) step, leading to more natural imputations and improved distributional properties of the resulting values:

head(irisImputed <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100))
#> 
#> Missing value imputation by random forests
#> 
#>   Variables to impute:       Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#>   Variables used to impute:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> iter 1:  .....
#> iter 2:  .....
#> iter 3:  .....
#> iter 4:  .....
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          5.8         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.4         0.4  setosa

Note that missRanger offers a ... argument to pass options to ranger, e.g. num.trees or min.node.size. How would we use its “extra trees” variant with 50 trees?

head(irisImputed_et <- missRanger(irisWithNA, pmm.k = 3, splitrule = "extratrees", num.trees = 50))
#> 
#> Missing value imputation by random forests
#> 
#>   Variables to impute:       Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#>   Variables used to impute:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> iter 1:  .....
#> iter 2:  .....
#> iter 3:  .....
#> iter 4:  .....
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.3         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.8         2.7          1.3         0.2  setosa
#> 5          5.0         3.6          1.4         0.4  setosa
#> 6          5.4         3.9          1.3         0.4  setosa

It is as simple!

Further note that missRanger does not rely on tidyverse but you can embed it into a dplyr pipeline (without group_by). Make sure to set verbose = 0 in order to prevent messages.

require(dplyr)

iris %>% 
  generateNA %>% 
  as_tibble %>% 
  missRanger(verbose = 0) %>% 
  head
#> # A tibble: 6 x 5
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#> 1         5.1         3.5          1.4        0.310 setosa 
#> 2         4.9         3.04         1.4        0.2   setosa 
#> 3         4.7         3.2          1.3        0.2   setosa 
#> 4         4.6         3.1          1.5        0.2   setosa 
#> 5         5           3.21         1.48       0.2   setosa 
#> 6         6.69        3.9          3.08       0.4   setosa

By default missRanger uses all columns in the data set to impute all columns with missings. To override this behaviour, you can use an intuitive formula interface: The left hand side specifies the variables to be imputed (variable names separated by a +), while the right hand side lists the variables used for imputation.

# Impute all variables with all (default behaviour). Note that variables without
# missing values will be skipped from the left hand side of the formula.
head(m <- missRanger(irisWithNA, formula = . ~ ., pmm.k = 3, num.trees = 10))
#> 
#> Missing value imputation by random forests
#> 
#>   Variables to impute:       Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#>   Variables used to impute:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> iter 1:  .....
#> iter 2:  .....
#> iter 3:  .....
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.5         0.2  setosa
#> 2          4.9         3.0          1.4         0.3  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          5.9         3.0          4.1         0.2  setosa
#> 5          5.0         3.6          1.4         0.4  setosa
#> 6          5.4         3.9          1.6         0.4  setosa

# Same
head(m <- missRanger(irisWithNA, pmm.k = 3, num.trees = 10))
#> 
#> Missing value imputation by random forests
#> 
#>   Variables to impute:       Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#>   Variables used to impute:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> iter 1:  .....
#> iter 2:  .....
#> iter 3:  .....
#> iter 4:  .....
#> iter 5:  .....
#> iter 6:  .....
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.5         0.2  setosa
#> 2          4.9         3.0          1.4         0.5  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          5.5         3.4          1.9         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.4         0.4  setosa

# Impute all variables with all except Species
head(m <- missRanger(irisWithNA, . ~ . - Species, pmm.k = 3, num.trees = 10))
#> 
#> Missing value imputation by random forests
#> 
#>   Variables to impute:       Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#>   Variables used to impute:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
#> iter 1:  .....
#> iter 2:  .....
#> iter 3:  .....
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.3         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          6.4         3.1          3.8         0.2  setosa
#> 5          5.0         3.6          1.4         0.1  setosa
#> 6          5.4         3.9          1.4         0.4  setosa

# Impute Sepal.Width by Species 
head(m <- missRanger(irisWithNA, Sepal.Width ~ Species, pmm.k = 3, num.trees = 10))
#> 
#> Missing value imputation by random forests
#> 
#>   Variables to impute:       Sepal.Width
#>   Variables used to impute:  
#> iter 1:  .
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5           NA         0.2  setosa
#> 2          4.9         3.0          1.4          NA  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4           NA         3.0           NA         0.2  setosa
#> 5          5.0         3.6          1.4          NA  setosa
#> 6          5.4         3.9           NA         0.4  setosa

# No success. Why? Species contains missing values and thus can only be used for imputation if it is being imputed as well
head(m <- missRanger(irisWithNA, Sepal.Width + Species ~ Species, pmm.k = 3, num.trees = 10))
#> 
#> Missing value imputation by random forests
#> 
#>   Variables to impute:       Sepal.Width, Species
#>   Variables used to impute:  Species
#> iter 1:  ..
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5           NA         0.2  setosa
#> 2          4.9         3.0          1.4          NA  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4           NA         3.2           NA         0.2  setosa
#> 5          5.0         3.6          1.4          NA  setosa
#> 6          5.4         3.9           NA         0.4  setosa

# Impute all variables univariatly
head(m <- missRanger(irisWithNA, . ~ 1))
#> 
#> Missing value imputation by random forests
#> 
#>   Variables to impute:       Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#>   Variables used to impute:  
#> iter 1:  .....
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.9         0.2  setosa
#> 2          4.9         3.0          1.4         2.0  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         4.1          5.6         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          5.9         0.4  setosa

Imputation takes too much time. What can I do?

missRanger is based on iteratively fitting random forests for each variable with missing values. Since the underlying random forest implementation ranger uses 500 trees per default, a huge number of trees might be calculated. For larger data sets, the overall process can take very long.

Here are tweaks to make things faster:

Use less trees, e.g. by setting num.trees = 50. Even one single tree might be sufficient. Typically, the number of iterations until convergence will increase with fewer trees though.
Use smaller bootstrap samples by setting e.g. sample.fraction = 0.1.
Use the less greedy splitrule = "extratrees".
Use a low tree depth max.depth = 6.
Use large leafs, e.g. min.node.size = 10000.
Use a low max.iter, e.g. 1 or 2.

Examples evaluated on a normal laptop (not run here)

require(ggplot2) # for diamonds data
dim(diamonds) # 53940    10

diamonds_with_NA <- generateNA(diamonds)

# Takes 270 seconds (10 * 500 trees per iteration!)
system.time(m <- missRanger(diamonds_with_NA, pmm.k = 3))

# Takes 19 seconds
system.time(m <- missRanger(diamonds_with_NA, pmm.k = 3, num.trees = 50))

# Takes 7 seconds
system.time(m <- missRanger(diamonds_with_NA, pmm.k = 3, num.trees = 1))

# Takes 9 seconds
system.time(m <- missRanger(diamonds_with_NA, pmm.k = 3, num.trees = 50, sample.fraction = 0.1))

Trick: Use `case.weights` to weight down contribution of rows with many missings

Using the case.weights argument, you can pass case weights to the imputation models. This might be e.g. useful to weight down the contribution of rows with many missings.

Example

# Count the number of non-missing values per row
non_miss <- rowSums(!is.na(irisWithNA))
table(non_miss)
#> non_miss
#>  1  2  3  4  5 
#>  2  6 28 68 46

# No weighting
head(m <- missRanger(irisWithNA, num.trees = 20, pmm.k = 3, seed = 5))
#> 
#> Missing value imputation by random forests
#> 
#>   Variables to impute:       Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#>   Variables used to impute:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> iter 1:  .....
#> iter 2:  .....
#> iter 3:  .....
#> iter 4:  .....
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.5         0.2  setosa
#> 2          4.9         3.0          1.4         0.1  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          5.7         3.8          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.5         0.4  setosa

# Weighted by number of non-missing values per row. 
head(m <- missRanger(irisWithNA, num.trees = 20, pmm.k = 3, seed = 5, case.weights = non_miss))
#> 
#> Missing value imputation by random forests
#> 
#>   Variables to impute:       Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#>   Variables used to impute:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> iter 1:  .....
#> iter 2:  .....
#> iter 3:  .....
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.3         0.2  setosa
#> 2          4.9         3.0          1.4         0.1  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          5.4         3.4          1.4         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.1         0.4  setosa

How to use `missRanger` in multiple imputation settings?

For machine learning tasks, imputation is typically seen as a fixed data preparation step like dummy coding. There, multiple imputation is rarely applied as it adds another level of complexity to the analysis. This might be fine since a good validation schema will account for variation introduced by imputation.

For tasks with focus on statistical inference (p values, standard errors, confidence intervals, estimation of effects), the extra variability introduced by imputation has to be accounted for except if only very few missing values appear. One of the standard approaches is to impute the data set multiple times, generating e.g. 10 or 100 versions of a complete data set. Then, the intended analysis (t-test, linear model etc.) is applied independently to each of the complete data sets. Their results are combined afterward in a pooling step, usually by Rubin’s rule [4]. For parameter estimates, averages are taken. Their variance is basically a combination of the average squared standard errors plus the variance of the parameter estimates across the imputed data sets, leading to inflated standard errors and thus larger p values and wider confidence intervals.

The package mice [3] takes case of this pooling step. The creation of multiple complete data sets can be done by mice or also by missRanger. In the latter case, in order to keep the variance of imputed values at a realistic level, we suggest to use predictive mean matching on top of the random forest imputations.

The following example shows how easy such workflow looks like.

irisWithNA <- generateNA(iris, p = c(0, 0.1, 0.1, 0.1, 0.1))

# Generate 20 complete data sets
filled <- replicate(20, missRanger(irisWithNA, verbose = 0, num.trees = 100, pmm.k = 5), simplify = FALSE)
                           
# Run a linear model for each of the completed data sets                          
models <- lapply(filled, function(x) lm(Sepal.Length ~ ., x))

# Pool the results by mice
require(mice)
summary(pooled_fit <- pool(models))
#>                     estimate  std.error statistic       df      p.value
#> (Intercept)        2.3237581 0.31344494  7.413609 79.09153 1.191021e-10
#> Sepal.Width        0.4820868 0.09965924  4.837352 70.91717 7.413834e-06
#> Petal.Length       0.7602107 0.07799618  9.746768 75.59176 5.329071e-15
#> Petal.Width       -0.2594143 0.15886858 -1.632886 77.07308 1.065721e-01
#> Speciesversicolor -0.6312472 0.27946995 -2.258730 59.13447 2.760202e-02
#> Speciesvirginica  -0.8448732 0.38541621 -2.192106 59.56188 3.229175e-02

# Compare with model on original data
summary(lm(Sepal.Length ~ ., data = iris))
#> 
#> Call:
#> lm(formula = Sepal.Length ~ ., data = iris)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.79424 -0.21874  0.00899  0.20255  0.73103 
#> 
#> Coefficients:
#>                   Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)        2.17127    0.27979   7.760 1.43e-12 ***
#> Sepal.Width        0.49589    0.08607   5.761 4.87e-08 ***
#> Petal.Length       0.82924    0.06853  12.101  < 2e-16 ***
#> Petal.Width       -0.31516    0.15120  -2.084  0.03889 *  
#> Speciesversicolor -0.72356    0.24017  -3.013  0.00306 ** 
#> Speciesvirginica  -1.02350    0.33373  -3.067  0.00258 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.3068 on 144 degrees of freedom
#> Multiple R-squared:  0.8673, Adjusted R-squared:  0.8627 
#> F-statistic: 188.3 on 5 and 144 DF,  p-value: < 2.2e-16

The standard errors and p values of the multiple imputation are larger than of the original data set. This reflects the additional uncertainty introduced by the presence of missing values in a realistic way.

How to deal with censored variables?

There is no obvious way of how to deal with survival variables as covariables in imputation models.

Options discussed in White [5] include:

Use both status variable s and (censored) time variable t
s and log(t)
surv(t), and, optionally s

By surv(t), we denote the Nelson-Aalen survival estimate at each value of t. The third option is the most elegant one as it explicitly deals with censoring information. We provide some additional details on it in the example

Example

require(survival)
require(dplyr)
head(veteran)
#>   trt celltype time status karno diagtime age prior
#> 1   1 squamous   72      1    60        7  69     0
#> 2   1 squamous  411      1    70        5  64    10
#> 3   1 squamous  228      1    60        3  38     0
#> 4   1 squamous  126      1    60        9  63    10
#> 5   1 squamous  118      1    70       11  65    10
#> 6   1 squamous   10      1    20        5  49     0
set.seed(653)

# For illustration, we use data from a randomized two-arm trial 
# about lung cancer. The aim is to estimate the treatment effect
# of "trt" with reliable inference using Cox regression. Unfortunately, 
# we generated missing values in the covariables "age" and "karno" (performance
# status). One approach is to use multiple imputation, see the section above.
# It is recommended to use the model response in the imputation models - 
# even if it sounds wrong. In case of a censored survival response
# (i.e. consisting of a time/status pair), an elegant 
# possibility is to represent it by the estimated Nelson-Aalen estimates [5].

# Add the Nelson-Aalen survival probabilities "surv" to the data set
veteran2 <- summary(survfit(Surv(time, status) ~ 1, data = veteran), 
                times = veteran$time)[c("time", "surv")] %>% 
            as_tibble %>% 
            right_join(veteran, by = "time")

# Add missing values to some columns. We do not add missing values
# in the survival information as this is usually the response of the (Cox-) 
# modelling process following the imputation.

veteran_with_NA <- generateNA(veteran2, p = c(age = 0.1, karno = 0.1, diagtime = 0.1))

# Generate 20 complete data sets and remove "surv"
filled <- replicate(20, missRanger(veteran_with_NA, . ~ . - time - status, 
  verbose = 0, pmm.k = 3, num.trees = 50), simplify = FALSE)

filled <- lapply(filled, function(data) {data$surv <- NULL; data})

# Run a Cox proportional hazards regression for each of the completed data sets
models <- lapply(filled, function(x) coxph(Surv(time, status) ~ ., x))

# Pool the results by mice
require(mice)
summary(pooled_fit <- pool(models))
#>                       estimate   std.error  statistic        df
#> trt                0.201959473 0.159267360  1.2680531  6176.473
#> celltypesmallcell  0.836528671 0.214067771  3.9077749 53541.171
#> celltypeadeno      1.079369471 0.238737193  4.5211618  5714.099
#> celltypelarge      0.327385574 0.243147111  1.3464506  5849.810
#> karno             -0.035056963 0.004379610 -8.0045848  2621.609
#> diagtime          -0.003081890 0.006138537 -0.5020561 10297.018
#> age               -0.005642109 0.007954351 -0.7093111   775.675
#> prior              0.007778027 0.018826526  0.4131419 12385.627
#>                        p.value
#> trt               2.048268e-01
#> celltypesmallcell 9.326449e-05
#> celltypeadeno     6.274303e-06
#> celltypelarge     1.782094e-01
#> karno             1.776357e-15
#> diagtime          6.156388e-01
#> age               4.783446e-01
#> prior             6.795098e-01

# On the original
summary(coxph(Surv(time, status) ~ ., veteran))
#> Call:
#> coxph(formula = Surv(time, status) ~ ., data = veteran)
#> 
#>   n= 137, number of events= 128 
#> 
#>                         coef  exp(coef)   se(coef)      z Pr(>|z|)    
#> trt                2.946e-01  1.343e+00  2.075e-01  1.419  0.15577    
#> celltypesmallcell  8.616e-01  2.367e+00  2.753e-01  3.130  0.00175 ** 
#> celltypeadeno      1.196e+00  3.307e+00  3.009e-01  3.975 7.05e-05 ***
#> celltypelarge      4.013e-01  1.494e+00  2.827e-01  1.420  0.15574    
#> karno             -3.282e-02  9.677e-01  5.508e-03 -5.958 2.55e-09 ***
#> diagtime           8.132e-05  1.000e+00  9.136e-03  0.009  0.99290    
#> age               -8.706e-03  9.913e-01  9.300e-03 -0.936  0.34920    
#> prior              7.159e-03  1.007e+00  2.323e-02  0.308  0.75794    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#>                   exp(coef) exp(-coef) lower .95 upper .95
#> trt                  1.3426     0.7448    0.8939    2.0166
#> celltypesmallcell    2.3669     0.4225    1.3799    4.0597
#> celltypeadeno        3.3071     0.3024    1.8336    5.9647
#> celltypelarge        1.4938     0.6695    0.8583    2.5996
#> karno                0.9677     1.0334    0.9573    0.9782
#> diagtime             1.0001     0.9999    0.9823    1.0182
#> age                  0.9913     1.0087    0.9734    1.0096
#> prior                1.0072     0.9929    0.9624    1.0541
#> 
#> Concordance= 0.736  (se = 0.021 )
#> Likelihood ratio test= 62.1  on 8 df,   p=2e-10
#> Wald test            = 62.37  on 8 df,   p=2e-10
#> Score (logrank) test = 66.74  on 8 df,   p=2e-11

How to deal with date variables etc.?

Originally, missRanger could deal only with factors and numeric variables. Since Release 2.1.0, most reasonable types are supported, including dates, date times etc. If there are problems with some special column type, you still have the option to convert it yourself or exclude it by the formula interface explained above.

Example

ir <- iris
ir$s <- iris$Species == "setosa"
ir$dt <- seq(Sys.time(), by = "1 min", length.out = 150)
ir$d <- seq(Sys.Date(), by = "1 d", length.out = 150)
ir$ch <- as.character(iris$Species)
head(ir <- generateNA(ir, c(rep(0.2, 7), 0, 0)))
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species    s
#> 1          5.1         3.5          1.4         0.2  setosa TRUE
#> 2          4.9         3.0          1.4          NA  setosa TRUE
#> 3          4.7         3.2          1.3         0.2  setosa TRUE
#> 4          4.6         3.1           NA          NA  setosa TRUE
#> 5           NA          NA          1.4         0.2  setosa TRUE
#> 6           NA         3.9          1.7         0.4  setosa TRUE
#>                    dt          d     ch
#> 1 2019-06-30 22:35:44 2019-06-30 setosa
#> 2 2019-06-30 22:36:44 2019-07-01 setosa
#> 3 2019-06-30 22:37:44 2019-07-02 setosa
#> 4 2019-06-30 22:38:44 2019-07-03 setosa
#> 5                <NA> 2019-07-04 setosa
#> 6 2019-06-30 22:40:44 2019-07-05 setosa
head(m <- missRanger(ir, pmm.k = 4))
#> 
#> Missing value imputation by random forests
#> 
#>   Variables to impute:       Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species, s, dt
#>   Variables used to impute:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species, s, dt, d, ch
#> iter 1:  .......
#> iter 2:  .......
#> iter 3:  .......
#> iter 4:  .......
#> iter 5:  .......
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species    s
#> 1          5.1         3.5          1.4         0.2  setosa TRUE
#> 2          4.9         3.0          1.4         0.1  setosa TRUE
#> 3          4.7         3.2          1.3         0.2  setosa TRUE
#> 4          4.6         3.1          1.5         0.2  setosa TRUE
#> 5          4.9         3.8          1.4         0.2  setosa TRUE
#> 6          5.1         3.9          1.7         0.4  setosa TRUE
#>                    dt          d     ch
#> 1 2019-06-30 22:35:44 2019-06-30 setosa
#> 2 2019-06-30 22:36:44 2019-07-01 setosa
#> 3 2019-06-30 22:37:44 2019-07-02 setosa
#> 4 2019-06-30 22:38:44 2019-07-03 setosa
#> 5 2019-06-30 22:36:44 2019-07-04 setosa
#> 6 2019-06-30 22:40:44 2019-07-05 setosa

Using missRanger

Michael Mayer

2019-06-30

Introduction

Installation

Examples

Imputation takes too much time. What can I do?

Examples evaluated on a normal laptop (not run here)

Trick: Use `case.weights` to weight down contribution of rows with many missings

Example

How to use `missRanger` in multiple imputation settings?

How to deal with censored variables?

Example

How to deal with date variables etc.?

Example

References

Using missRanger

Michael Mayer

2019-06-30

Introduction

Installation

Examples

Imputation takes too much time. What can I do?

Examples evaluated on a normal laptop (not run here)

Trick: Use case.weights to weight down contribution of rows with many missings

Example

How to use missRanger in multiple imputation settings?

How to deal with censored variables?

Example

How to deal with date variables etc.?

Example

References

Trick: Use `case.weights` to weight down contribution of rows with many missings

How to use `missRanger` in multiple imputation settings?