2 Preparing data

2.1 Spliting Train and test

To avoid introducing a bias in test using train-data, the train-test split should be performed before (most) data preparation steps.

To simulate a train and test set we are going to split randomly this data set into 80% train and 20% test.

# Random sample indexes
train_index <- sample(1:nrow(adult), 0.8 * nrow(adult))
test_index <- setdiff(1:nrow(adult), train_index)

# Build X_train, y_train, X_test, y_test
X_train <- adult[train_index, -15]
y_train <- adult[train_index, "income"]

X_test <- adult[test_index, -15]
y_test <- adult[test_index, "income"]

2.2 Filter useless variables

The first thing to do, in order to make computation fast, would be to filter useless variables:

Constant variables
Variables that are in double (for example col1 == col2)
Variables that are exact bijections (for example col1 = A, B, B, A and col2 = 1, 2, 2, 1)

Let’s id them:

constant_cols <- whichAreConstant(adult)

# [1] "whichAreConstant: it took me 0s to identify 0 constant column(s)"

double_cols <- whichAreInDouble(adult)

# [1] "whichAreInDouble: it took me 0s to identify 0 column(s) to drop."

bijections_cols <- whichAreBijection(adult)

# [1] "whichAreBijection: education_num is a bijection of education. I put it in drop list."
# [1] "whichAreBijection: it took me 0.07s to identify 1 column(s) to drop."

We only found, one bijection: variable education_num which is an index for variable education. Let’s drop it:

X_train$education_num = NULL
X_test$education_num = NULL

2.3 Scaling

Most machine learning algorithm rather handle scaled data instead of unscaled data.

To perform scaling (meaning setting mean to 0 and standard deviation to 1), function fastScale is available.

Since it is highly recommended to apply same scaling on train and test, you should compute the scales first using the function build_scales:

scales <- build_scales(dataSet = X_train, cols = c("capital_gain", "capital_loss"), verbose = TRUE)

# [1] "build_scales: I will compute scale on  2 numeric columns."
# [1] "build_scales: it took me: 0s to compute scale for 2 numeric columns."

print(scales)

# $capital_gain
# $capital_gain$mean
# [1] 1085.825
# 
# $capital_gain$sd
# [1] 7428.122
# 
# 
# $capital_loss
# $capital_loss$mean
# [1] 85.09924
# 
# $capital_loss$sd
# [1] 398.067

As one can see, those to columns have very different mean and standard deviation. Let’s apply scaling on those:

X_train <- fastScale(dataSet = X_train, scales = scales, verbose = TRUE)

# [1] "fastScale: I will scale 2 numeric columns."
# [1] "fastScale: it took me: 0s to scale 2 numeric columns."

X_test <- fastScale(dataSet = X_test, scales = scales, verbose = TRUE)

# [1] "fastScale: I will scale 2 numeric columns."
# [1] "fastScale: it took me: 0s to scale 2 numeric columns."

And now let’s have a look at the result:

print(head(X_train[, c("capital_gain", "capital_loss")]))

#    capital_gain capital_loss
# 1:    0.4009324   -0.2137812
# 2:   -0.1461776    4.5643086
# 3:   -0.1461776    3.7152054
# 4:   -0.1461776   -0.2137812
# 5:    0.8363049   -0.2137812
# 6:   -0.1461776   -0.2137812

2.4 Discretization

One might want to discretize the variable age, either using an equal freq/width method, or some hand-written bis.

To compute equal freq bins, build_bins is available:

bins <- build_bins(dataSet = X_train, cols = "age", n_bins = 10, type = "equal_freq")

# [1] "fastDiscretization: I will build splits for 1 numeric columns using, equal_freq method."
# [1] "fastDiscretization: it took me: 0s to build splits for 1 numeric columns."

print(bins)

# $age
#  [1] -Inf   22   26   30   33   37   41   45   51   58  Inf

To make it easy to use, in this package:

dataSet will always denote the data.table on which you want to perform something.
cols will always denote the columns on which you want to apply the function. It could also be set to “auto” to apply it on all relevant columns.
Some spefic argument could be needed and will be presented in the documentation of each functions.

Let’s apply our own bins:

X_train <- fastDiscretization(dataSet = X_train, bins = list(age = c(0, 18, 25, 45, 62, +Inf)))

# [1] "fastDiscretization: I will discretize 1 numeric columns using, bins."
# [1] "fastDiscretization: it took me: 0.08s to transform 1 numeric columns into, binarised columns."

X_test <- fastDiscretization(dataSet = X_test, bins = list(age = c(0, 18, 25, 45, 62, +Inf)))

# [1] "fastDiscretization: I will discretize 1 numeric columns using, bins."
# [1] "fastDiscretization: it took me: 0.01s to transform 1 numeric columns into, binarised columns."

Here bins have been defined to compute groups :

0 to 18
18 to 25
25 to 45
45 to 62
Over 62.

Let’s control it:

print(table(X_train$age))

# 
#    [0, 18[   [18, 25[   [25, 45[   [45, 62[ [62, +Inf[ 
#        319       4156      13264       6645       1664

2.5 Encoding categorical

One thing to do when you are using some machine learning algorithm such as a logistic regression or a neural network is to encode factor variables. One way to do that is to perform one-hot-encoding. For examples:

ID	var
1	A
2	B
3	C
4	C

Would become:

ID	var.A	var.B	var.C
1	1	0	0
2	0	1	0
3	0	0	1
4	0	0	1

To perform it, one could use dataPreparation::one_hot_encoder which uses data.table power to do it in a fast and RAM efficient way. Since it is important to have the same columns in train and test first, one will compute the encoding:

encoding <- build_encoding(dataSet = X_train, cols = "auto", verbose = TRUE)

# [1] "build_encoding: I will compute encoding on 9 character and factor columns."
# [1] "build_encoding: it took me: 0s to compute encoding for 9 character and factor columns."

The argument cols = “auto” means that build_encoding will automatically select all columns that are either character or factor to prepare encoding.

And then one can apply them to both tables:

X_train <- one_hot_encoder(dataSet = X_train, encoding = encoding, drop = TRUE, verbose = TRUE)

# [1] "one_hot_encoder: I will one hot encode some columns."
# [1] "one_hot_encoder: I am doing column: age"
# [1] "one_hot_encoder: I am doing column: type_employer"
# [1] "one_hot_encoder: I am doing column: education"
# [1] "one_hot_encoder: I am doing column: marital"
# [1] "one_hot_encoder: I am doing column: occupation"
# [1] "one_hot_encoder: I am doing column: relationship"
# [1] "one_hot_encoder: I am doing column: race"
# [1] "one_hot_encoder: I am doing column: sex"
# [1] "one_hot_encoder: I am doing column: country"
# [1] "one_hot_encoder: It took me 0.02s to transform 9 column(s)."

X_test <- one_hot_encoder(dataSet = X_test, encoding = encoding, drop = TRUE, verbose = TRUE)

# [1] "one_hot_encoder: I will one hot encode some columns."
# [1] "one_hot_encoder: I am doing column: age"
# [1] "one_hot_encoder: I am doing column: type_employer"
# [1] "one_hot_encoder: I am doing column: education"
# [1] "one_hot_encoder: I am doing column: marital"
# [1] "one_hot_encoder: I am doing column: occupation"
# [1] "one_hot_encoder: I am doing column: relationship"
# [1] "one_hot_encoder: I am doing column: race"
# [1] "one_hot_encoder: I am doing column: sex"
# [1] "one_hot_encoder: I am doing column: country"
# [1] "one_hot_encoder: It took me 0.01s to transform 9 column(s)."

This function is called the following way:

dataSet = X_train: means that it will perform transformation on X_train
encoding = encoding: means that we use previously built encoding
drop = TRUE: means that it will drop original columns
verbose = TRUE: means that it will log to tell you what it is doing.

Even if it’s not kept in the log, a progress bar has been created to see if the functions is running and how fast. This progress bar is available in most functions from this package. It can be really helpfull when you are handling really large data sets.

Let’s check the dimensions of X:

print("Dimensions of X_train: ")

# [1] "Dimensions of X_train: "

print(dim(X_train))

# [1] 26048   111

print("Dimensions of X_test: ")

# [1] "Dimensions of X_test: "

print(dim(X_test))

# [1] 6513  111

2.6 Filtering variables

Since a lot of columns have been created, a filtering could be relevant:

bijections <- whichAreBijection(dataSet = X_train, verbose = TRUE)

# [1] "whichAreBijection: sex.Male is a bijection of sex.Female. I put it in drop list."
# [1] "whichAreBijection: it took me 8.86s to identify 1 column(s) to drop."

It found that column Male is a bijection of column female, that is not surprinsing. Let’s drop one of them:

X_train$Male = NULL
X_test$Male = NULL

Tutorial to prepare train and test set using dataPreparation

2020-02-12

1 Introduction

1.1 Purpouse of this vignette

1.2 Data set