This package offers a number of commonly used single imputation methods, each with a similar and hopefully simple interface. At the moment the following imputation methodology is supported.
The basic version of the package can be installed with
install.packages('simputation')
The package uses several other packages for model building. To keep the package lightweight, many of these are not installed immediately when you install simputation. To install all dependencies do
install.packages('simputation', dependencies=TRUE)
Alternatively, one can manually install extra dependencies. Below is an overview of what packages are used.
function. | model | package | R.recommended |
---|---|---|---|
impute_rlm | M-estimation | MASS | yes |
impute_en | ridge/elasticnet/lasso | glmnet | no |
impute_cart | CART | rpart | yes |
impute_rf | random forest | randomForest | no |
impute_rhd | random hot deck | VIM (optional) | no |
impute_shd | sequential hot deck | VIM (optional) | no |
impute_knn | k nearest neighbours | VIM (optional) | no |
impute_mf | missForest | missForest | no |
impute_em | mv-normal | norm | no |
Packages available in the R recommended suite are installed by default when R is
installed (unless this was explicitly prevented at installation). Hotdeck
imputation (knn, rhd, shd) have their own implementations in simputation. The
VIM
package can be used as an optional backend for these.
A call to an imputation function has the following structure.
impute_<model>(data, formula, [model-specific options])
The output is similar to the data
argument, except that empty values are
imputed (where possible) using the specified model.
The formula
argument speciefies the variables to be imputed, the model
specification for <model>
and possibly the grouping of the dataset.
The structure of a formula object is as follows:
IMPUTED ~ MODEL_SPECIFICATION [ | GROUPING ]
where the part between []
is optional.
In the following, we assume that the reader already has some familiarity with the use of formulas in R (e.g. when specifying linear models) and statistical models commonly used in imputation.
First create a copy of the iris dataset with some empty values in columns
1 (Sepal.Length
), 2 (Sepal.Width
) and 5 (Species
).
dat <- iris
dat[1:3,1] <- dat[3:7,2] <- dat[8:10,5] <- NA
head(dat,10)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 NA 3.5 1.4 0.2 setosa
## 2 NA 3.0 1.4 0.2 setosa
## 3 NA NA 1.3 0.2 setosa
## 4 4.6 NA 1.5 0.2 setosa
## 5 5.0 NA 1.4 0.2 setosa
## 6 5.4 NA 1.7 0.4 setosa
## 7 4.6 NA 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 <NA>
## 9 4.4 2.9 1.4 0.2 <NA>
## 10 4.9 3.1 1.5 0.1 <NA>
To impute Sepal.Length
using a linear model use the impute_lm
function.
da1 <- impute_lm(dat, Sepal.Length ~ Sepal.Width + Species)
head(da1,3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.076579 3.5 1.4 0.2 setosa
## 2 4.675654 3.0 1.4 0.2 setosa
## 3 NA NA 1.3 0.2 setosa
Observe that the 3rd value is not imputed. This is because one of the predictor variables
is missing so the linear model does not produce an output. simputation
does not report such cases but simply returns the partly imputed result. The remaining value can be imputed
using a new linear model or as shown below, using the group median.
da2 <- impute_median(da1, Sepal.Length ~ Species)
head(da2,3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.076579 3.5 1.4 0.2 setosa
## 2 4.675654 3.0 1.4 0.2 setosa
## 3 5.000000 NA 1.3 0.2 setosa
Here, Species
is used to group the data before computing the medians.
Finally, we impute the Species
variable using a decision tree model. All variables except Species
are used as predictor.
da3 <- impute_cart(da2, Species ~ .)
head(da3,10)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.076579 3.5 1.4 0.2 setosa
## 2 4.675654 3.0 1.4 0.2 setosa
## 3 5.000000 NA 1.3 0.2 setosa
## 4 4.600000 NA 1.5 0.2 setosa
## 5 5.000000 NA 1.4 0.2 setosa
## 6 5.400000 NA 1.7 0.4 setosa
## 7 4.600000 NA 1.4 0.3 setosa
## 8 5.000000 3.4 1.5 0.2 setosa
## 9 4.400000 2.9 1.4 0.2 setosa
## 10 4.900000 3.1 1.5 0.1 setosa
Using the %>%
operator from the popular magrittr allows for a very compact
specification of the above examples.
library(magrittr)
da4 <- dat %>%
impute_lm(Sepal.Length ~ Sepal.Width + Species) %>%
impute_median(Sepal.Length ~ Species) %>%
impute_cart(Species ~ .)
The simputation package allows users to specify an imputation model for multiple
variables at once. For example, to impute both Sepal.Length
and Sepal.Width
with a similar robust linear model, do the following.
da5 <- impute_rlm(dat, Sepal.Length + Sepal.Width ~ Petal.Length + Species)
head(da5)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 4.945416 3.500000 1.4 0.2 setosa
## 2 4.945416 3.000000 1.4 0.2 setosa
## 3 4.854056 3.378979 1.3 0.2 setosa
## 4 4.600000 3.440107 1.5 0.2 setosa
## 5 5.000000 3.409543 1.4 0.2 setosa
## 6 5.400000 3.501236 1.7 0.4 setosa
The function will model Sepal.Length
and Sepal.Width
against the predictor
variables independently and impute them. The order of variables in the
specification is therefore not important for the result.
In general, the left-hand side of the model formula is analyzed by simputation
,
combined appropriately with the right hand side and then passed through to the underlying modeling routine. Simputation also understands the "."
syntax, which stands for “every
variable not otherwise present” and the “-” sign to remove variables from a formula. For example, the next expression imputes every variable except Species
with the group
mean plus a normally distributed random residual.
da6 <- impute_lm(dat, . - Species ~ 0 + Species, add_residual = "normal")
head(da6)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.277045 3.500000 1.4 0.2 setosa
## 2 5.833579 3.000000 1.4 0.2 setosa
## 3 5.706201 3.512426 1.3 0.2 setosa
## 4 4.600000 3.922893 1.5 0.2 setosa
## 5 5.000000 3.276839 1.4 0.2 setosa
## 6 5.400000 3.594306 1.7 0.4 setosa
where Species
on the right-hand-side defines the grouping variable.
Use |
in the formula
argument to specify groups.
# New data set, leaving Species intact
dat <- iris
dat[1:3,1] <- dat[3:7,2] <- NA
# split dat into groups according to 'Species', impute, combine and return.
da8 <- impute_lm(dat, Sepal.Length ~ Petal.Width | Species)
head(da8)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 4.968092 3.5 1.4 0.2 setosa
## 2 4.968092 3.0 1.4 0.2 setosa
## 3 4.968092 NA 1.3 0.2 setosa
## 4 4.600000 NA 1.5 0.2 setosa
## 5 5.000000 NA 1.4 0.2 setosa
## 6 5.400000 NA 1.7 0.4 setosa
If one or more grouping variables are specified (multiple are specified by separating them with +
), imputation takes place as follows.
Simputation also integrates with the dplyr package and recognizes grouping specified with group_by
.
library(magrittr)
library(dplyr)
dat <- iris
dat[1:3,1] <- dat[3:7,2] <- NA
dat %>% group_by(Species) %>%
impute_lm(Sepal.Length ~ Petal.Width)
The impute_proxy
function is somewhat special since it allows you to define
an imputation method in the right-hand-side of the formula object. Below we
implement a robust ratio imputation' (for what its worth) as example.
dat <- iris
dat[1:3,1] <- dat[3:7,2] <- NA
dat <- impute_proxy(dat, Sepal.Length ~ median(Sepal.Length,na.rm=TRUE)/median(Sepal.Width, na.rm=TRUE) * Sepal.Width | Species)
head(dat)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.147059 3.5 1.4 0.2 setosa
## 2 4.411765 3.0 1.4 0.2 setosa
## 3 NA NA 1.3 0.2 setosa
## 4 4.600000 NA 1.5 0.2 setosa
## 5 5.000000 NA 1.4 0.2 setosa
## 6 5.400000 NA 1.7 0.4 setosa
This can be done with the impute
function. To use it, train your
model in the way you are used to.
m <- lm(Sepal.Length ~ Sepal.Width + Species, data=iris)
Next, use this model to impute a dataset.
dat <- iris
dat[1:3,1] <- dat[3:7,2] <- NA
head(dat)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 NA 3.5 1.4 0.2 setosa
## 2 NA 3.0 1.4 0.2 setosa
## 3 NA NA 1.3 0.2 setosa
## 4 4.6 NA 1.5 0.2 setosa
## 5 5.0 NA 1.4 0.2 setosa
## 6 5.4 NA 1.7 0.4 setosa
dat <- impute(dat, Sepal.Length ~ m)
head(dat)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.063856 3.5 1.4 0.2 setosa
## 2 4.662076 3.0 1.4 0.2 setosa
## 3 NA NA 1.3 0.2 setosa
## 4 4.600000 NA 1.5 0.2 setosa
## 5 5.000000 NA 1.4 0.2 setosa
## 6 5.400000 NA 1.7 0.4 setosa
That's really all there is to it.
The VIM package offers fast implementations for sequential and random hotdeck procedures (based on the data.table package). It also offers somewhat finer control over certain features such as donor selection. For this reason, the sequential, random, and k-nearest neighbours hotdeck imputation procedures can be told to use VIM as backend.
dat <- data.frame(
foo = c(1,2,NA,4)
, bar = c(1,NA,8,NA)
)
# sequential hotdeck imputation, no sorting variables
impute_shd(dat, . ~ 1, pool="complete")
impute_shd(dat, . ~ 1, pool="univariate")
impute_shd(dat, .~1, backend="VIM")
Note that VIM uses last observation carried forward by default, and the specification of donor pool
is on a per-variable basis (this cannot be changed). See ?impute_shd
for the full specification.