This vignette introduces dataPreparation package (v0.2), what it offers, how simple it is to use it.

1 Introduction

1.1 Package presentation

Based on data.table package, dataPreparation will allow you to do most of the painful data preparation for a data science project with a minimum amount of code.

This package is

fast (use data.table and exponential search)
RAM efficient (perform operations by reference and column-wise to avoid copying data)
stable (most exceptions are handled)
verbose (log a lot)

data.table and other dependencies are handled at installation.

1.2 Main preparation steps

Before using any machine learning (ML) algorithm, one needs to prepare its data. Preparing a data set for a data science project can be long and tricky. The main steps are the followings:

Read: load the data set (this package don’t treat this point: for csv we recommend data.table::fread)
Correct: most of the times, there are some mistake after reading, wrong format… one have to correct them
Transform: creating new features from date, categorical, character… in order to have information usable for a ML algorithm (aka: numeric or categorical)
Filter: get rid of useless information in order to speed up computation
Handle NA: replace missing values
Pre model transformation: Specific manipulation for the chosen model (handling NA, discretization, one hot encoding, scaling…)
Shape: put your data set in a nice shape usable by a ML algorithm

Here are the functions available in this package to tackle those issues:

Correct	Transform	Filter	Pre model manipulation	Shape
unFactor	generateDateDiffs	fastFilterVariables	fastHandleNa	shapeSet
findAndTransformDates	generateFactorFromDate	whichAreConstant	fastDiscretization	sameShape
findAndTransformNumerics	aggregateByKey	whichAreInDouble	fastScale	setAsNumericMatrix
setColAsCharacter	generateFromFactor	whichAreBijection		one_hot_encoder
setColAsNumeric	generateFromCharacter
setColAsDate	fastRound
setColAsFactor	target_encode

All of those functions are integrated in the full pipeline function prepareSet.

In this tutorial we will detail all those steps and how to treat them with this package using an example data set.

1.3 Tutorial data

For this tutorial, we are going to use a messy version of adult data base.

data(messy_adult)
print(head(messy_adult, n = 4))

#        date1      date2        date3              date4    num1   num2 constant
# 1:      <NA> 1510441200  24-Mar-2017     26-march, 2017  1.9309 0,0864        1
# 2: 2017-26-9 1490482800  01-Feb-2017  03-february, 2017 -0.4273 0,6345        1
# 3:      <NA> 1510614000  18-Sep-2017 20-september, 2017  0.6093 1,8958        1
# 4:  2017-6-1         NA  25-Jun-2017      27-june, 2017 -0.5138 0,4505        1
#                                mail    num3 age    type_employer fnlwgt
# 1:          pierre.caroline@aol.com  1,9309  39        State-gov  77516
# 2:           pierre.lucas@yahoo.com -0,4273  50 Self-emp-not-inc  83311
# 3: caroline.caroline@protonmail.com  0,6093  38          Private 215646
# 4:         marie.caroline@gmail.com -0,5138  53          Private 234721
#    education education_num            marital        occupation  relationship
# 1: Bachelors            13      Never-married      Adm-clerical Not-in-family
# 2: Bachelors            13 Married-civ-spouse   Exec-managerial       Husband
# 3:   HS-grad             9           Divorced Handlers-cleaners Not-in-family
# 4:      11th             7 Married-civ-spouse Handlers-cleaners       Husband
#     race  sex capital_gain capital_loss hr_per_week       country income
# 1: White Male         2174            0          40 United-States  <=50K
# 2: White Male            0            0          13 United-States  <=50K
# 3: White Male            0            0          40 United-States  <=50K
# 4: Black Male            0            0          40 United-States  <=50K

We added 9 really ugly columns to the data set:

4 dates with various formats, or time stamps, and NAs
1 constant column
3 numeric with different decimal separator
1 email address

The same info can be contained in two different columns.

2 Correct functions

2.1 Identifying factor that shouldn’t be

It often happens when reading a data set that R put string into a factor even if it shouldn’t be. In this tutorial data set, mail is a factor but shouldn’t be. It will automatically be detected using unFactor function:

print(class(messy_adult$mail))

# "factor"

messy_adult <- unFactor(messy_adult)

# "unFactor: I will identify variable that are factor but shouldn't be."
# "unFactor: I unfactor mail."
# "unFactor: It took me 0s to unfactor 1 column(s)."

print(class(messy_adult$mail))

# "character"

2.2 Identifing and transforming date columns

The next thing to do is to identify columns that are dates (the first 4 ones) and transform them.

messy_adult <- findAndTransformDates(messy_adult)

# "findAndTransformDates: It took me 0.61s to identify formats"
# "findAndTransformDates: It took me 0.07s to transform 4 columns to a Date format."

Let’s have a look to the transformation performed on those 4 columns:

date1_prev	date2_prev	date3_prev	date4_prev	transfo	date1	date2	date3	date4
NA	1510441200	24-Mar-2017	26-march, 2017	=>	NA	2017-11-12 00:00:00	2017-03-24	2017-03-26
2017-26-9	1490482800	01-Feb-2017	03-february, 2017	=>	2017-09-26	2017-03-26 00:00:00	2017-02-01	2017-02-03
NA	1510614000	18-Sep-2017	20-september, 2017	=>	NA	2017-11-14 00:00:00	2017-09-18	2017-09-20
2017-6-1	NA	25-Jun-2017	27-june, 2017	=>	2017-01-06	NA	2017-06-25	2017-06-27
NA	1494457200	26-Jan-2017	28-january, 2017	=>	NA	2017-05-11 01:00:00	2017-01-26	2017-01-28
2017-18-7	1494370800	04-Apr-2017	06-april, 2017	=>	2017-07-18	2017-05-10 01:00:00	2017-04-04	2017-04-06

As one can see, even if formats were different and somehow ugly, they were all handled.

2.3 Identifying and transforming numeric columns

And now the same thing with numeric

messy_adult <- findAndTransformNumerics(messy_adult)

# "findAndTransformNumerics: It took me 0s to identify 3 numerics column(s), i will set them as numerics"
# "setColAsNumeric: I will set some columns as numeric"
# "setColAsNumeric: I am doing the column num1."
# "setColAsNumeric: 0 NA have been created due to transformation to numeric."
# "setColAsNumeric: I will set some columns as numeric"
# "setColAsNumeric: I am doing the column num2."
# "setColAsNumeric: 0 NA have been created due to transformation to numeric."
# "setColAsNumeric: I am doing the column num3."
# "setColAsNumeric: 0 NA have been created due to transformation to numeric."
# "findAndTransformNumerics: It took me 0.04s to transform 3 column(s) to a numeric format."

num1_prev	num2_prev	num3_prev	transfo	num1	num2	num3
1.9309	0,0864	1,9309	=>	1.9309	0.0864	1.9309
-0.4273	0,6345	-0,4273	=>	-0.4273	0.6345	-0.4273
0.6093	1,8958	0,6093	=>	0.6093	1.8958	0.6093
-0.5138	0,4505	-0,5138	=>	-0.5138	0.4505	-0.5138
1.0563	1,342	1,0563	=>	1.0563	1.3420	1.0563
-0.9377	-0,0421	-0,9377	=>	-0.9377	-0.0421	-0.9377

So now our data set is a bit less ugly.

3 Filter functions

The idea now is to identify useless columns:

constant columns: they take the same value for every line,
double columns: they have an exact copy in the data set,
bijection columns: there is another column containing the exact same information (but maybe coded differently) for example col1: Men/Women, col2 M/W.

3.1 Look for constant variables

constant_cols <- whichAreConstant(messy_adult)

# "whichAreConstant: constant is constant."
# "whichAreConstant: it took me 0s to identify 1 constant column(s)"

3.2 Look for columns in double

double_cols <- whichAreInDouble(messy_adult)

# "whichAreInDouble: num3 is exactly equal to num1. I put it in drop list."
# "whichAreInDouble: it took me 0.01s to identify 1 column(s) to drop."

3.3 Look for columns that are bijections of one another

bijections_cols <- whichAreBijection(messy_adult)

# "whichAreBijection: date4 is a bijection of date3. I put it in drop list."
# "whichAreBijection: num3 is a bijection of num1. I put it in drop list."
# "whichAreBijection: education_num is a bijection of education. I put it in drop list."
# "whichAreBijection: it took me 0.17s to identify 3 column(s) to drop."

To control this, let’s have a look to the concerned columns:

constant	date3	date4	num1	num3	education	education_num
1	2017-03-24	2017-03-26	1.9309	1.9309	Bachelors	13
1	2017-02-01	2017-02-03	-0.4273	-0.4273	Bachelors	13
1	2017-09-18	2017-09-20	0.6093	0.6093	HS-grad	9
1	2017-06-25	2017-06-27	-0.5138	-0.5138	11th	7
1	2017-01-26	2017-01-28	1.0563	1.0563	Bachelors	13
1	2017-04-04	2017-04-06	-0.9377	-0.9377	Masters	14

Indeed:

constant was build constant, it contains only 1,
num1 and num3 are equal,
date3 and date4 are separated by 2 days: date4 doesn’t contain any new information for a ML algorithm,
education and education_num contains the same information one with a key index, the other one with the character corresponding. whichAreBijection keeps the character column.

3.4 Filter them all

To directly filter all of them:

ncols <- ncol(messy_adult)
messy_adult <- fastFilterVariables(messy_adult)
print(paste0("messy_adult now have ", ncol(messy_adult), " columns; so ", ncols - ncol(messy_adult), " less than before."))

# "fastFilterVariables: I check for constant columns."
# "fastFilterVariables: I delete 1 constant column(s) in dataSet."
# "fastFilterVariables: I check for columns in double."
# "fastFilterVariables: I delete 1 column(s) that are in double in dataSet."
# "fastFilterVariables: I check for columns that are bijections of another column."
# "fastFilterVariables: I delete 2 column(s) that are bijections of another column in dataSet."
# "messy_adult now have 20 columns; so 4 less than before."

4 useless rows have been deleted. Without those useless columns, your machine learning algorithm will at least be faster and maybe give better results.

4 Transform functions

Before sending this to a machine learning algorithm, a few transformations should be performed.

The idea with the functions presented here is to perform those transformations in a RAM efficient way.

4.1 Dates differences

Since no machine learning algorithm handle Dates, one needs to transform them or drop them. A way to transform dates is to perform differences between every date.

We can also add an analysis date to compare dates with the date your data is from. For example, if you have a birth-date you may want to compute age by performing today - birth-date.

messy_adult <- generateDateDiffs(messy_adult, cols = "auto", analysisDate = as.Date("2018-01-01"), units = "days")

# "generateDateDiffs: I will generate difference between dates."
# "generateDateDiffs: It took me 0s to create 6 column(s)."

…	date1.Minus.date3	date1.Minus.analysisDate	date2.Minus.date3	date2.Minus.analysisDate	date3.Minus.analysisDate
…	NA	NA	232.95833	-50	-282.9583
…	237	-96.95833	52.95833	-281	-333.9583
…	NA	NA	56.95833	-48	-104.9583
…	-170	-359.95833	NA	NA	-189.9583
…	NA	NA	104.95833	-235	-339.9583
…	105	-166.95833	35.95833	-236	-271.9583

4.2 Transforming dates into aggregates

Another way to work around dates would be to aggregate them at some level. This time drop is set to TRUE in order to drop date columns.

messy_adult <- generateFactorFromDate(messy_adult, cols = "auto", type = "quarter", drop = TRUE)

# "generateFactorFromDate: I will create a factor column from each date column."
# "generateFactorFromDate: It took me 0.04s to transform 3 column(s)."

…	date1.quarter	date2.quarter	date3.quarter
…	QNA	Q4	Q1
…	Q3	Q1	Q1
…	QNA	Q4	Q3
…	Q1	QNA	Q2
…	QNA	Q2	Q1
…	Q3	Q2	Q2

4.3 Generate features from character columns

Character columns are not handled by any machine learning algorithm, one should transform them. Function generateFromCharacter build some new feature from them, and then drop them.

messy_adult <- generateFromCharacter(messy_adult, cols = "auto", drop = TRUE)

# "generateFromCharacter: it took me: 0.01s to transform 1 character columns into, 3 new columns."

mail.notnull	mail.num	mail.order
FALSE	200	1
FALSE	200	1
FALSE	200	1
FALSE	200	1
FALSE	200	1
FALSE	200	1

4.4 Aggregate according to a key

To model something by country; one would want to to compute an aggregation of this table in order to have one line per country.

agg_adult <- aggregateByKey(messy_adult, key = "country")

# "aggregateByKey: I start to aggregate"
# "aggregateByKey: 139 columns have been constructed. It took 0.24 seconds. "

country	max.age	education.Assoc-acdm	…
?	90	10	…
Cambodia	65	0	…
Canada	80	1	…
China	75	0	…
Columbia	75	4	…
Cuba	82	3	…

Every time you have more than one line per individual this function would be pretty cool.

4.5 Rounding

One might want to round numeric variables in order to save some RAM, or for algorithmic reasons:

messy_adult <- fastRound(messy_adult, digits = 2)

num1	num2	age	type_employer	fnlwgt	education	…
0.59	-0.50	60	Private	173960	Bachelors	…
NA	-0.60	25	Private	371987	Bachelors	…
NA	0.48	26	Private	94936	Assoc-acdm	…
0.02	2.83	28	Private	166481	7th-8th	…
-0.87	-0.39	45	Self-emp-inc	197332	Some-college	…
1.20	-0.74	31	Private	244147	HS-grad	…

5 Handling NAs values

Then, let’s handle NAs

messy_adult <- fastHandleNa(messy_adult)

#    num1  num2 age type_employer   ...       country income date1.Minus.date2
# 1: 0.59 -0.50  60       Private   ... United-States  <=50K           -173.96
# 2: 0.00 -0.60  25       Private   ... United-States  <=50K             23.04
# 3: 0.00  0.48  26       Private   ... United-States  <=50K            -73.96
# 4: 0.02  2.83  28       Private   ...   Puerto-Rico  <=50K           -234.96
#    date1.Minus.date3 date1.Minus.analysisDate date2.Minus.date3
# 1:                65                  -293.96            238.96
# 2:              -117                  -334.96           -140.04
# 3:               -33                  -138.96             40.96
# 4:              -228                  -336.96              6.96
#    date2.Minus.analysisDate date3.Minus.analysisDate date1.quarter
# 1:                     -120                  -358.96            Q1
# 2:                     -358                  -217.96            Q1
# 3:                      -65                  -105.96            Q3
# 4:                     -102                  -108.96            Q1
#    date2.quarter date3.quarter mail.notnull mail.num mail.order
# 1:            Q3            Q1        FALSE      200          1
# 2:            Q1            Q2        FALSE      200          1
# 3:            Q4            Q3        FALSE      200          1
# 4:            Q3            Q3        FALSE      200          1

It set default values in place of NA. If you want to put some specific values (constants, or even a function for example mean of values) you should go check fastHandleNa documentation.

6 Shape functions

There are two types of machine learning algorithm in R: those which accept data.table and factor, those which only accept numeric matrix.

Transforming a data set into something acceptable for a machine learning algorithm could be tricky.

The shapeSet function do it for you, you just have to choose if you want a data.table or a numerical_matrix.

First with data.table:

clean_adult = shapeSet(copy(messy_adult), finalForm = "data.table", verbose = FALSE)

# "setColAsFactor: num1 has more than 10 values, i don't transform it."
# "setColAsFactor: num2 has more than 10 values, i don't transform it."
# "setColAsFactor: age has more than 10 values, i don't transform it."
# "setColAsFactor: fnlwgt has more than 10 values, i don't transform it."
# "setColAsFactor: capital_gain has more than 10 values, i don't transform it."
# "setColAsFactor: capital_loss has more than 10 values, i don't transform it."
# "setColAsFactor: hr_per_week has more than 10 values, i don't transform it."
# "setColAsFactor: date1.Minus.date2 has more than 10 values, i don't transform it."
# "setColAsFactor: date1.Minus.date3 has more than 10 values, i don't transform it."
# "setColAsFactor: date1.Minus.analysisDate has more than 10 values, i don't transform it."
# "setColAsFactor: date2.Minus.date3 has more than 10 values, i don't transform it."
# "setColAsFactor: date2.Minus.analysisDate has more than 10 values, i don't transform it."
# "setColAsFactor: date3.Minus.analysisDate has more than 10 values, i don't transform it."
# "setColAsFactor: mail.num has more than 10 values, i don't transform it."
# "setColAsFactor: mail.order has more than 10 values, i don't transform it."

print(table(sapply(clean_adult, class)))

# 
#  factor integer numeric 
#      12       1      15

As one can see, there only are, numeric and factors.

Now with numerical_matrix:

clean_adult <- shapeSet(copy(messy_adult), finalForm = "numerical_matrix", verbose = FALSE)

# "setColAsFactor: num1 has more than 10 values, i don't transform it."
# "setColAsFactor: num2 has more than 10 values, i don't transform it."
# "setColAsFactor: age has more than 10 values, i don't transform it."
# "setColAsFactor: fnlwgt has more than 10 values, i don't transform it."
# "setColAsFactor: capital_gain has more than 10 values, i don't transform it."
# "setColAsFactor: capital_loss has more than 10 values, i don't transform it."
# "setColAsFactor: hr_per_week has more than 10 values, i don't transform it."
# "setColAsFactor: date1.Minus.date2 has more than 10 values, i don't transform it."
# "setColAsFactor: date1.Minus.date3 has more than 10 values, i don't transform it."
# "setColAsFactor: date1.Minus.analysisDate has more than 10 values, i don't transform it."
# "setColAsFactor: date2.Minus.date3 has more than 10 values, i don't transform it."
# "setColAsFactor: date2.Minus.analysisDate has more than 10 values, i don't transform it."
# "setColAsFactor: date3.Minus.analysisDate has more than 10 values, i don't transform it."
# "setColAsFactor: mail.num has more than 10 values, i don't transform it."
# "setColAsFactor: mail.order has more than 10 values, i don't transform it."

num1	num2	age	…
0.59	-0.50	60	…
0.00	-0.60	25	…
0.00	0.48	26	…
0.02	2.83	28	…
-0.87	-0.39	45	…
1.20	-0.74	31	…

As one can see, with finalForm = "numerical_matrix" every character and factor have been binarized.

7 Full pipeline

Doing it all with one function is possible:

To do that we will reload the ugly data set and perform aggregation.

data("messy_adult")
agg_adult <- prepareSet(messy_adult, finalForm = "data.table", key = "country", analysisDate = Sys.Date(), digits = 2)

# "prepareSet: step one: correcting mistakes."
# "fastFilterVariables: I check for constant columns."
# "fastFilterVariables: I delete 1 constant column(s) in dataSet."
# "fastFilterVariables: I check for columns in double."
# "fastFilterVariables: I check for columns that are bijections of another column."
# "fastFilterVariables: I delete 1 column(s) that are bijections of another column in dataSet."
# "unFactor: I will identify variable that are factor but shouldn't be."
# "unFactor: I unfactor mail."
# "unFactor: It took me 0s to unfactor 1 column(s)."
# "findAndTransformNumerics: It took me 0s to identify 3 numerics column(s), i will set them as numerics"
# "setColAsNumeric: I will set some columns as numeric"
# "setColAsNumeric: I am doing the column num1."
# "setColAsNumeric: 0 NA have been created due to transformation to numeric."
# "setColAsNumeric: I will set some columns as numeric"
# "setColAsNumeric: I am doing the column num2."
# "setColAsNumeric: 0 NA have been created due to transformation to numeric."
# "setColAsNumeric: I am doing the column num3."
# "setColAsNumeric: 0 NA have been created due to transformation to numeric."
# "findAndTransformNumerics: It took me 0.04s to transform 3 column(s) to a numeric format."
# "findAndTransformDates: It took me 0.33s to identify formats"
# "findAndTransformDates: It took me 0.07s to transform 4 columns to a Date format."
# "prepareSet: step two: transforming dataSet."
# "generateDateDiffs: I will generate difference between dates."
# "generateDateDiffs: It took me 0.01s to create 10 column(s)."
# "generateFactorFromDate: I will create a factor column from each date column."
# "generateFactorFromDate: It took me 0.05s to transform 4 column(s)."
# "generateFromCharacter: it took me: 0.01s to transform 1 character columns into, 3 new columns."
# "aggregateByKey: I start to aggregate"
# "aggregateByKey: 193 columns have been constructed. It took 0.27 seconds. "
# "prepareSet: step three: filtering dataSet."
# "fastFilterVariables: I check for constant columns."
# "fastFilterVariables: I delete 3 constant column(s) in result."
# "fastFilterVariables: I check for columns in double."
# "fastFilterVariables: I delete 6 column(s) that are in double in result."
# "fastFilterVariables: I check for columns that are bijections of another column."
# "fastFilterVariables: I delete 46 column(s) that are bijections of another column in result."
# "prepareSet: step four: handling NA."
# "prepareSet: step five: shaping result."
# "setColAsFactor: I will set some columns to factor."
# "setColAsFactor: it took me: 0s to transform 0 column(s) to factor."
# "shapeSet: Transforming numerical variables into factors when length(unique(col)) <= 10."
# "setColAsFactor: nbr_lines has more than 10 values, i don't transform it."
# "setColAsFactor: max.age has more than 10 values, i don't transform it."
# "setColAsFactor: type_employer.? has more than 10 values, i don't transform it."
# "setColAsFactor: type_employer.Local-gov has more than 10 values, i don't transform it."
# "setColAsFactor: type_employer.Private has more than 10 values, i don't transform it."
# "setColAsFactor: type_employer.Self-emp-not-inc has more than 10 values, i don't transform it."
# "setColAsFactor: education.11th has more than 10 values, i don't transform it."
# "setColAsFactor: education.5th-6th has more than 10 values, i don't transform it."
# "setColAsFactor: education.7th-8th has more than 10 values, i don't transform it."
# "setColAsFactor: education.Bachelors has more than 10 values, i don't transform it."
# "setColAsFactor: education.HS-grad has more than 10 values, i don't transform it."
# "setColAsFactor: education.Masters has more than 10 values, i don't transform it."
# "setColAsFactor: education.Some-college has more than 10 values, i don't transform it."
# "setColAsFactor: marital.Divorced has more than 10 values, i don't transform it."
# "setColAsFactor: marital.Married-civ-spouse has more than 10 values, i don't transform it."
# "setColAsFactor: marital.Married-spouse-absent has more than 10 values, i don't transform it."
# "setColAsFactor: marital.Never-married has more than 10 values, i don't transform it."
# "setColAsFactor: marital.Separated has more than 10 values, i don't transform it."
# "setColAsFactor: occupation.Adm-clerical has more than 10 values, i don't transform it."
# "setColAsFactor: occupation.Craft-repair has more than 10 values, i don't transform it."
# "setColAsFactor: occupation.Exec-managerial has more than 10 values, i don't transform it."
# "setColAsFactor: occupation.Handlers-cleaners has more than 10 values, i don't transform it."
# "setColAsFactor: occupation.Machine-op-inspct has more than 10 values, i don't transform it."
# "setColAsFactor: occupation.Other-service has more than 10 values, i don't transform it."
# "setColAsFactor: occupation.Prof-specialty has more than 10 values, i don't transform it."
# "setColAsFactor: occupation.Sales has more than 10 values, i don't transform it."
# "setColAsFactor: occupation.Transport-moving has more than 10 values, i don't transform it."
# "setColAsFactor: relationship.Husband has more than 10 values, i don't transform it."
# "setColAsFactor: relationship.Not-in-family has more than 10 values, i don't transform it."
# "setColAsFactor: relationship.Other-relative has more than 10 values, i don't transform it."
# "setColAsFactor: relationship.Own-child has more than 10 values, i don't transform it."
# "setColAsFactor: relationship.Unmarried has more than 10 values, i don't transform it."
# "setColAsFactor: relationship.Wife has more than 10 values, i don't transform it."
# "setColAsFactor: race.Asian-Pac-Islander has more than 10 values, i don't transform it."
# "setColAsFactor: race.Black has more than 10 values, i don't transform it."
# "setColAsFactor: race.Other has more than 10 values, i don't transform it."
# "setColAsFactor: race.White has more than 10 values, i don't transform it."
# "setColAsFactor: sex.Female has more than 10 values, i don't transform it."
# "setColAsFactor: sex.Male has more than 10 values, i don't transform it."
# "setColAsFactor: mean.capital_gain has more than 10 values, i don't transform it."
# "setColAsFactor: max.capital_gain has more than 10 values, i don't transform it."
# "setColAsFactor: mean.capital_loss has more than 10 values, i don't transform it."
# "setColAsFactor: max.capital_loss has more than 10 values, i don't transform it."
# "setColAsFactor: sd.capital_loss has more than 10 values, i don't transform it."
# "setColAsFactor: min.hr_per_week has more than 10 values, i don't transform it."
# "setColAsFactor: max.hr_per_week has more than 10 values, i don't transform it."
# "setColAsFactor: income.<=50K has more than 10 values, i don't transform it."
# "setColAsFactor: income.>50K has more than 10 values, i don't transform it."
# "setColAsFactor: date1.yearmonth.NA has more than 10 values, i don't transform it."
# "setColAsFactor: date1.yearmonth.2017 Apr has more than 10 values, i don't transform it."
# "setColAsFactor: date1.yearmonth.2017 Aug has more than 10 values, i don't transform it."
# "setColAsFactor: date1.yearmonth.2017 Dec has more than 10 values, i don't transform it."
# "setColAsFactor: date1.yearmonth.2017 Feb has more than 10 values, i don't transform it."
# "setColAsFactor: date1.yearmonth.2017 Jan has more than 10 values, i don't transform it."
# "setColAsFactor: date1.yearmonth.2017 Jul has more than 10 values, i don't transform it."
# "setColAsFactor: date1.yearmonth.2017 Jun has more than 10 values, i don't transform it."
# "setColAsFactor: date1.yearmonth.2017 Mar has more than 10 values, i don't transform it."
# "setColAsFactor: date1.yearmonth.2017 May has more than 10 values, i don't transform it."
# "setColAsFactor: date1.yearmonth.2017 Nov has more than 10 values, i don't transform it."
# "setColAsFactor: date1.yearmonth.2017 Oct has more than 10 values, i don't transform it."
# "setColAsFactor: date1.yearmonth.2017 Sep has more than 10 values, i don't transform it."
# "setColAsFactor: date2.yearmonth.NA has more than 10 values, i don't transform it."
# "setColAsFactor: date2.yearmonth.2017 Apr has more than 10 values, i don't transform it."
# "setColAsFactor: date2.yearmonth.2017 Aug has more than 10 values, i don't transform it."
# "setColAsFactor: date2.yearmonth.2017 Dec has more than 10 values, i don't transform it."
# "setColAsFactor: date2.yearmonth.2017 Feb has more than 10 values, i don't transform it."
# "setColAsFactor: date2.yearmonth.2017 Jan has more than 10 values, i don't transform it."
# "setColAsFactor: date2.yearmonth.2017 Jul has more than 10 values, i don't transform it."
# "setColAsFactor: date2.yearmonth.2017 Jun has more than 10 values, i don't transform it."
# "setColAsFactor: date2.yearmonth.2017 Mar has more than 10 values, i don't transform it."
# "setColAsFactor: date2.yearmonth.2017 May has more than 10 values, i don't transform it."
# "setColAsFactor: date2.yearmonth.2017 Nov has more than 10 values, i don't transform it."
# "setColAsFactor: date2.yearmonth.2017 Oct has more than 10 values, i don't transform it."
# "setColAsFactor: date2.yearmonth.2017 Sep has more than 10 values, i don't transform it."
# "setColAsFactor: date3.yearmonth.2017 Apr has more than 10 values, i don't transform it."
# "setColAsFactor: date3.yearmonth.2017 Aug has more than 10 values, i don't transform it."
# "setColAsFactor: date3.yearmonth.2017 Dec has more than 10 values, i don't transform it."
# "setColAsFactor: date3.yearmonth.2017 Feb has more than 10 values, i don't transform it."
# "setColAsFactor: date3.yearmonth.2017 Jan has more than 10 values, i don't transform it."
# "setColAsFactor: date3.yearmonth.2017 Jul has more than 10 values, i don't transform it."
# "setColAsFactor: date3.yearmonth.2017 Jun has more than 10 values, i don't transform it."
# "setColAsFactor: date3.yearmonth.2017 Mar has more than 10 values, i don't transform it."
# "setColAsFactor: date3.yearmonth.2017 May has more than 10 values, i don't transform it."
# "setColAsFactor: date3.yearmonth.2017 Nov has more than 10 values, i don't transform it."
# "setColAsFactor: date3.yearmonth.2017 Oct has more than 10 values, i don't transform it."
# "setColAsFactor: date3.yearmonth.2017 Sep has more than 10 values, i don't transform it."
# "setColAsFactor: date4.yearmonth.2017 Apr has more than 10 values, i don't transform it."
# "setColAsFactor: date4.yearmonth.2017 Aug has more than 10 values, i don't transform it."
# "setColAsFactor: date4.yearmonth.2017 Dec has more than 10 values, i don't transform it."
# "setColAsFactor: date4.yearmonth.2017 Feb has more than 10 values, i don't transform it."
# "setColAsFactor: date4.yearmonth.2017 Jan has more than 10 values, i don't transform it."
# "setColAsFactor: date4.yearmonth.2017 Jul has more than 10 values, i don't transform it."
# "setColAsFactor: date4.yearmonth.2017 Jun has more than 10 values, i don't transform it."
# "setColAsFactor: date4.yearmonth.2017 Mar has more than 10 values, i don't transform it."
# "setColAsFactor: date4.yearmonth.2017 May has more than 10 values, i don't transform it."
# "setColAsFactor: date4.yearmonth.2017 Nov has more than 10 values, i don't transform it."
# "setColAsFactor: date4.yearmonth.2017 Oct has more than 10 values, i don't transform it."
# "setColAsFactor: date4.yearmonth.2017 Sep has more than 10 values, i don't transform it."
# "setColAsFactor: max.mail.num has more than 10 values, i don't transform it."
# "setColAsFactor: min.mail.order has more than 10 values, i don't transform it."
# "setColAsFactor: max.mail.order has more than 10 values, i don't transform it."
# "shapeSet: Previous distribution of column types:"
# col_class_init
#  factor numeric 
#       1     137 
# "shapeSet: Current distribution of column types:"
# col_class_end
#  factor numeric 
#      37     101

As one can see, every previously steps have been done.

Let’s have a look to the result

# "138 columns have been built; for 42 countries."

country	nbr_lines	mean.num2	sd.num2	min.age	…
?	529	0	0	17	…
Cambodia	16	0.08	0.78	25	…
Canada	108	0	0	17	…
China	67	0	0	22	…
Columbia	53	0	0	21	…
Cuba	88	0	0	21	…

Tutorial

2020-02-12