Tutorial to prepare train and test set using dataPreparation

2020-02-12

1 Introduction

1.1 Purpouse of this vignette

This vignette is a tutorial to prepare a train and a test set using dataPreparation package.

In this tutorial the following points are going to be viewed:

Using dataPreparation package, those sets will be

1.2 Data set

For this tutorial, UCI adult data set will be used.

The goal with this data set is to predict the income of individuals based on 14 variables.

Let’s have a look to the data set:

data("adult")
print(head(adult, n = 4))
#   age    type_employer fnlwgt education education_num            marital
# 1  39        State-gov  77516 Bachelors            13      Never-married
# 2  50 Self-emp-not-inc  83311 Bachelors            13 Married-civ-spouse
# 3  38          Private 215646   HS-grad             9           Divorced
# 4  53          Private 234721      11th             7 Married-civ-spouse
#          occupation  relationship  race  sex capital_gain capital_loss
# 1      Adm-clerical Not-in-family White Male         2174            0
# 2   Exec-managerial       Husband White Male            0            0
# 3 Handlers-cleaners Not-in-family White Male            0            0
# 4 Handlers-cleaners       Husband Black Male            0            0
#   hr_per_week       country income
# 1          40 United-States  <=50K
# 2          13 United-States  <=50K
# 3          40 United-States  <=50K
# 4          40 United-States  <=50K

2 Preparing data

2.1 Spliting Train and test

To avoid introducing a bias in test using train-data, the train-test split should be performed before (most) data preparation steps.

To simulate a train and test set we are going to split randomly this data set into 80% train and 20% test.

# Random sample indexes
train_index <- sample(1:nrow(adult), 0.8 * nrow(adult))
test_index <- setdiff(1:nrow(adult), train_index)

# Build X_train, y_train, X_test, y_test
X_train <- adult[train_index, -15]
y_train <- adult[train_index, "income"]

X_test <- adult[test_index, -15]
y_test <- adult[test_index, "income"]

2.2 Filter useless variables

The first thing to do, in order to make computation fast, would be to filter useless variables:

Let’s id them:

constant_cols <- whichAreConstant(adult)
# [1] "whichAreConstant: it took me 0s to identify 0 constant column(s)"
double_cols <- whichAreInDouble(adult)
# [1] "whichAreInDouble: it took me 0s to identify 0 column(s) to drop."
bijections_cols <- whichAreBijection(adult)
# [1] "whichAreBijection: education_num is a bijection of education. I put it in drop list."
# [1] "whichAreBijection: it took me 0.07s to identify 1 column(s) to drop."

We only found, one bijection: variable education_num which is an index for variable education. Let’s drop it:

X_train$education_num = NULL
X_test$education_num = NULL

2.3 Scaling

Most machine learning algorithm rather handle scaled data instead of unscaled data.

To perform scaling (meaning setting mean to 0 and standard deviation to 1), function fastScale is available.

Since it is highly recommended to apply same scaling on train and test, you should compute the scales first using the function build_scales:

scales <- build_scales(dataSet = X_train, cols = c("capital_gain", "capital_loss"), verbose = TRUE)
# [1] "build_scales: I will compute scale on  2 numeric columns."
# [1] "build_scales: it took me: 0s to compute scale for 2 numeric columns."
print(scales)
# $capital_gain
# $capital_gain$mean
# [1] 1085.825
# 
# $capital_gain$sd
# [1] 7428.122
# 
# 
# $capital_loss
# $capital_loss$mean
# [1] 85.09924
# 
# $capital_loss$sd
# [1] 398.067

As one can see, those to columns have very different mean and standard deviation. Let’s apply scaling on those:

X_train <- fastScale(dataSet = X_train, scales = scales, verbose = TRUE)
# [1] "fastScale: I will scale 2 numeric columns."
# [1] "fastScale: it took me: 0s to scale 2 numeric columns."
X_test <- fastScale(dataSet = X_test, scales = scales, verbose = TRUE)
# [1] "fastScale: I will scale 2 numeric columns."
# [1] "fastScale: it took me: 0s to scale 2 numeric columns."

And now let’s have a look at the result:

print(head(X_train[, c("capital_gain", "capital_loss")]))
#    capital_gain capital_loss
# 1:    0.4009324   -0.2137812
# 2:   -0.1461776    4.5643086
# 3:   -0.1461776    3.7152054
# 4:   -0.1461776   -0.2137812
# 5:    0.8363049   -0.2137812
# 6:   -0.1461776   -0.2137812

2.4 Discretization

One might want to discretize the variable age, either using an equal freq/width method, or some hand-written bis.

To compute equal freq bins, build_bins is available:

bins <- build_bins(dataSet = X_train, cols = "age", n_bins = 10, type = "equal_freq")
# [1] "fastDiscretization: I will build splits for 1 numeric columns using, equal_freq method."
# [1] "fastDiscretization: it took me: 0s to build splits for 1 numeric columns."
print(bins)
# $age
#  [1] -Inf   22   26   30   33   37   41   45   51   58  Inf

To make it easy to use, in this package:

Let’s apply our own bins:

X_train <- fastDiscretization(dataSet = X_train, bins = list(age = c(0, 18, 25, 45, 62, +Inf)))
# [1] "fastDiscretization: I will discretize 1 numeric columns using, bins."
# [1] "fastDiscretization: it took me: 0.08s to transform 1 numeric columns into, binarised columns."
X_test <- fastDiscretization(dataSet = X_test, bins = list(age = c(0, 18, 25, 45, 62, +Inf)))
# [1] "fastDiscretization: I will discretize 1 numeric columns using, bins."
# [1] "fastDiscretization: it took me: 0.01s to transform 1 numeric columns into, binarised columns."

Here bins have been defined to compute groups :

Let’s control it:

print(table(X_train$age))
# 
#    [0, 18[   [18, 25[   [25, 45[   [45, 62[ [62, +Inf[ 
#        319       4156      13264       6645       1664

2.5 Encoding categorical

One thing to do when you are using some machine learning algorithm such as a logistic regression or a neural network is to encode factor variables. One way to do that is to perform one-hot-encoding. For examples:

ID var
1 A
2 B
3 C
4 C

Would become:

ID var.A var.B var.C
1 1 0 0
2 0 1 0
3 0 0 1
4 0 0 1

To perform it, one could use dataPreparation::one_hot_encoder which uses data.table power to do it in a fast and RAM efficient way. Since it is important to have the same columns in train and test first, one will compute the encoding:

encoding <- build_encoding(dataSet = X_train, cols = "auto", verbose = TRUE)
# [1] "build_encoding: I will compute encoding on 9 character and factor columns."
# [1] "build_encoding: it took me: 0s to compute encoding for 9 character and factor columns."

The argument cols = “auto” means that build_encoding will automatically select all columns that are either character or factor to prepare encoding.

And then one can apply them to both tables:

X_train <- one_hot_encoder(dataSet = X_train, encoding = encoding, drop = TRUE, verbose = TRUE)
# [1] "one_hot_encoder: I will one hot encode some columns."
# [1] "one_hot_encoder: I am doing column: age"
# [1] "one_hot_encoder: I am doing column: type_employer"
# [1] "one_hot_encoder: I am doing column: education"
# [1] "one_hot_encoder: I am doing column: marital"
# [1] "one_hot_encoder: I am doing column: occupation"
# [1] "one_hot_encoder: I am doing column: relationship"
# [1] "one_hot_encoder: I am doing column: race"
# [1] "one_hot_encoder: I am doing column: sex"
# [1] "one_hot_encoder: I am doing column: country"
# [1] "one_hot_encoder: It took me 0.02s to transform 9 column(s)."
X_test <- one_hot_encoder(dataSet = X_test, encoding = encoding, drop = TRUE, verbose = TRUE)
# [1] "one_hot_encoder: I will one hot encode some columns."
# [1] "one_hot_encoder: I am doing column: age"
# [1] "one_hot_encoder: I am doing column: type_employer"
# [1] "one_hot_encoder: I am doing column: education"
# [1] "one_hot_encoder: I am doing column: marital"
# [1] "one_hot_encoder: I am doing column: occupation"
# [1] "one_hot_encoder: I am doing column: relationship"
# [1] "one_hot_encoder: I am doing column: race"
# [1] "one_hot_encoder: I am doing column: sex"
# [1] "one_hot_encoder: I am doing column: country"
# [1] "one_hot_encoder: It took me 0.01s to transform 9 column(s)."

This function is called the following way:

Even if it’s not kept in the log, a progress bar has been created to see if the functions is running and how fast. This progress bar is available in most functions from this package. It can be really helpfull when you are handling really large data sets.

Let’s check the dimensions of X:

print("Dimensions of X_train: ")
# [1] "Dimensions of X_train: "
print(dim(X_train))
# [1] 26048   111
print("Dimensions of X_test: ")
# [1] "Dimensions of X_test: "
print(dim(X_test))
# [1] 6513  111

2.6 Filtering variables

Since a lot of columns have been created, a filtering could be relevant:

bijections <- whichAreBijection(dataSet = X_train, verbose = TRUE)
# [1] "whichAreBijection: sex.Male is a bijection of sex.Female. I put it in drop list."
# [1] "whichAreBijection: it took me 8.86s to identify 1 column(s) to drop."

It found that column Male is a bijection of column female, that is not surprinsing. Let’s drop one of them:

X_train$Male = NULL
X_test$Male = NULL

3 Controling shape

Last but not least, it is very important to make sure that train and test sets have the same shape (for example the same columns).

To make sure of that one could perform the following function:

X_test <- sameShape(X_test, referenceSet = X_test, verbose = TRUE)
# [1] "sameShape: verify that every column is present."
# [1] "sameShape: drop unwanted columns."
# [1] "sameShape: verify that every column is in the right type."
# [1] "sameShape: verify that every factor as the right number of levels."
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE

No warning have been raised it’s all is ok.

4 Conclusion

We presented some of the functions of dataPreparation package. There are a few more available, plus they have some parameters to make their use easier. So if you liked it, please go check the package documentation (by installing it or on CRAN)

We hope that this package is helpful, that it helped you prepare your data in a faster way.

If you would like to give us some feedback, report some issues, add some features to this package, please tell us on GitHub. Also if you want to contribute, please don’t hesitate to contact us.