This vignette is a tutorial to prepare a train
and a test
set using dataPreparation
package.
In this tutorial the following points are going to be viewed:
Using dataPreparation package, those sets will be
For this tutorial, UCI adult data set will be used.
The goal with this data set is to predict the income of individuals based on 14 variables.
Let’s have a look to the data set:
data("adult")
print(head(adult, n = 4))
# age type_employer fnlwgt education education_num marital
# 1 39 State-gov 77516 Bachelors 13 Never-married
# 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
# 3 38 Private 215646 HS-grad 9 Divorced
# 4 53 Private 234721 11th 7 Married-civ-spouse
# occupation relationship race sex capital_gain capital_loss
# 1 Adm-clerical Not-in-family White Male 2174 0
# 2 Exec-managerial Husband White Male 0 0
# 3 Handlers-cleaners Not-in-family White Male 0 0
# 4 Handlers-cleaners Husband Black Male 0 0
# hr_per_week country income
# 1 40 United-States <=50K
# 2 13 United-States <=50K
# 3 40 United-States <=50K
# 4 40 United-States <=50K
To avoid introducing a bias in test
using train-data, the train-test split should be performed before (most) data preparation steps.
To simulate a train and test set we are going to split randomly this data set into 80% train and 20% test.
# Random sample indexes
train_index <- sample(1:nrow(adult), 0.8 * nrow(adult))
test_index <- setdiff(1:nrow(adult), train_index)
# Build X_train, y_train, X_test, y_test
X_train <- adult[train_index, -15]
y_train <- adult[train_index, "income"]
X_test <- adult[test_index, -15]
y_test <- adult[test_index, "income"]
The first thing to do, in order to make computation fast, would be to filter useless variables:
Let’s id them:
constant_cols <- whichAreConstant(adult)
# [1] "whichAreConstant: it took me 0s to identify 0 constant column(s)"
double_cols <- whichAreInDouble(adult)
# [1] "whichAreInDouble: it took me 0s to identify 0 column(s) to drop."
bijections_cols <- whichAreBijection(adult)
# [1] "whichAreBijection: education_num is a bijection of education. I put it in drop list."
# [1] "whichAreBijection: it took me 0.07s to identify 1 column(s) to drop."
We only found, one bijection: variable education_num
which is an index for variable education
. Let’s drop it:
X_train$education_num = NULL
X_test$education_num = NULL
Most machine learning algorithm rather handle scaled data instead of unscaled data.
To perform scaling (meaning setting mean to 0 and standard deviation to 1), function fastScale
is available.
Since it is highly recommended to apply same scaling on train and test, you should compute the scales first using the function build_scales
:
scales <- build_scales(dataSet = X_train, cols = c("capital_gain", "capital_loss"), verbose = TRUE)
# [1] "build_scales: I will compute scale on 2 numeric columns."
# [1] "build_scales: it took me: 0s to compute scale for 2 numeric columns."
print(scales)
# $capital_gain
# $capital_gain$mean
# [1] 1085.825
#
# $capital_gain$sd
# [1] 7428.122
#
#
# $capital_loss
# $capital_loss$mean
# [1] 85.09924
#
# $capital_loss$sd
# [1] 398.067
As one can see, those to columns have very different mean and standard deviation. Let’s apply scaling on those:
X_train <- fastScale(dataSet = X_train, scales = scales, verbose = TRUE)
# [1] "fastScale: I will scale 2 numeric columns."
# [1] "fastScale: it took me: 0s to scale 2 numeric columns."
X_test <- fastScale(dataSet = X_test, scales = scales, verbose = TRUE)
# [1] "fastScale: I will scale 2 numeric columns."
# [1] "fastScale: it took me: 0s to scale 2 numeric columns."
And now let’s have a look at the result:
print(head(X_train[, c("capital_gain", "capital_loss")]))
# capital_gain capital_loss
# 1: 0.4009324 -0.2137812
# 2: -0.1461776 4.5643086
# 3: -0.1461776 3.7152054
# 4: -0.1461776 -0.2137812
# 5: 0.8363049 -0.2137812
# 6: -0.1461776 -0.2137812
One might want to discretize the variable age, either using an equal freq/width method, or some hand-written bis.
To compute equal freq bins, build_bins
is available:
bins <- build_bins(dataSet = X_train, cols = "age", n_bins = 10, type = "equal_freq")
# [1] "fastDiscretization: I will build splits for 1 numeric columns using, equal_freq method."
# [1] "fastDiscretization: it took me: 0s to build splits for 1 numeric columns."
print(bins)
# $age
# [1] -Inf 22 26 30 33 37 41 45 51 58 Inf
To make it easy to use, in this package:
dataSet
will always denote the data.table on which you want to perform something.cols
will always denote the columns on which you want to apply the function. It could also be set to “auto” to apply it on all relevant columns.Let’s apply our own bins:
X_train <- fastDiscretization(dataSet = X_train, bins = list(age = c(0, 18, 25, 45, 62, +Inf)))
# [1] "fastDiscretization: I will discretize 1 numeric columns using, bins."
# [1] "fastDiscretization: it took me: 0.08s to transform 1 numeric columns into, binarised columns."
X_test <- fastDiscretization(dataSet = X_test, bins = list(age = c(0, 18, 25, 45, 62, +Inf)))
# [1] "fastDiscretization: I will discretize 1 numeric columns using, bins."
# [1] "fastDiscretization: it took me: 0.01s to transform 1 numeric columns into, binarised columns."
Here bins have been defined to compute groups :
Let’s control it:
print(table(X_train$age))
#
# [0, 18[ [18, 25[ [25, 45[ [45, 62[ [62, +Inf[
# 319 4156 13264 6645 1664
One thing to do when you are using some machine learning algorithm such as a logistic regression or a neural network is to encode factor variables. One way to do that is to perform one-hot-encoding. For examples:
ID | var |
---|---|
1 | A |
2 | B |
3 | C |
4 | C |
Would become:
ID | var.A | var.B | var.C |
---|---|---|---|
1 | 1 | 0 | 0 |
2 | 0 | 1 | 0 |
3 | 0 | 0 | 1 |
4 | 0 | 0 | 1 |
To perform it, one could use dataPreparation::one_hot_encoder
which uses data.table power to do it in a fast and RAM efficient way. Since it is important to have the same columns in train and test first, one will compute the encoding:
encoding <- build_encoding(dataSet = X_train, cols = "auto", verbose = TRUE)
# [1] "build_encoding: I will compute encoding on 9 character and factor columns."
# [1] "build_encoding: it took me: 0s to compute encoding for 9 character and factor columns."
The argument cols = “auto” means that build_encoding will automatically select all columns that are either character or factor to prepare encoding.
And then one can apply them to both tables:
X_train <- one_hot_encoder(dataSet = X_train, encoding = encoding, drop = TRUE, verbose = TRUE)
# [1] "one_hot_encoder: I will one hot encode some columns."
# [1] "one_hot_encoder: I am doing column: age"
# [1] "one_hot_encoder: I am doing column: type_employer"
# [1] "one_hot_encoder: I am doing column: education"
# [1] "one_hot_encoder: I am doing column: marital"
# [1] "one_hot_encoder: I am doing column: occupation"
# [1] "one_hot_encoder: I am doing column: relationship"
# [1] "one_hot_encoder: I am doing column: race"
# [1] "one_hot_encoder: I am doing column: sex"
# [1] "one_hot_encoder: I am doing column: country"
# [1] "one_hot_encoder: It took me 0.02s to transform 9 column(s)."
X_test <- one_hot_encoder(dataSet = X_test, encoding = encoding, drop = TRUE, verbose = TRUE)
# [1] "one_hot_encoder: I will one hot encode some columns."
# [1] "one_hot_encoder: I am doing column: age"
# [1] "one_hot_encoder: I am doing column: type_employer"
# [1] "one_hot_encoder: I am doing column: education"
# [1] "one_hot_encoder: I am doing column: marital"
# [1] "one_hot_encoder: I am doing column: occupation"
# [1] "one_hot_encoder: I am doing column: relationship"
# [1] "one_hot_encoder: I am doing column: race"
# [1] "one_hot_encoder: I am doing column: sex"
# [1] "one_hot_encoder: I am doing column: country"
# [1] "one_hot_encoder: It took me 0.01s to transform 9 column(s)."
This function is called the following way:
Even if it’s not kept in the log, a progress bar has been created to see if the functions is running and how fast. This progress bar is available in most functions from this package. It can be really helpfull when you are handling really large data sets.
Let’s check the dimensions of X:
print("Dimensions of X_train: ")
# [1] "Dimensions of X_train: "
print(dim(X_train))
# [1] 26048 111
print("Dimensions of X_test: ")
# [1] "Dimensions of X_test: "
print(dim(X_test))
# [1] 6513 111
Since a lot of columns have been created, a filtering could be relevant:
bijections <- whichAreBijection(dataSet = X_train, verbose = TRUE)
# [1] "whichAreBijection: sex.Male is a bijection of sex.Female. I put it in drop list."
# [1] "whichAreBijection: it took me 8.86s to identify 1 column(s) to drop."
It found that column Male is a bijection of column female, that is not surprinsing. Let’s drop one of them:
X_train$Male = NULL
X_test$Male = NULL
Last but not least, it is very important to make sure that train
and test
sets have the same shape (for example the same columns).
To make sure of that one could perform the following function:
X_test <- sameShape(X_test, referenceSet = X_test, verbose = TRUE)
# [1] "sameShape: verify that every column is present."
# [1] "sameShape: drop unwanted columns."
# [1] "sameShape: verify that every column is in the right type."
# [1] "sameShape: verify that every factor as the right number of levels."
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
# [1] FALSE
No warning have been raised it’s all is ok.
We presented some of the functions of dataPreparation package. There are a few more available, plus they have some parameters to make their use easier. So if you liked it, please go check the package documentation (by installing it or on CRAN)
We hope that this package is helpful, that it helped you prepare your data in a faster way.
If you would like to give us some feedback, report some issues, add some features to this package, please tell us on GitHub. Also if you want to contribute, please don’t hesitate to contact us.