If you created a dataset to create a classification model, you must perform cleansing of the data. After you create the dataset, you should do the following:
The alookr package makes these steps fast and easy:
To illustrate basic use of the alookr package, create the data_exam
with sample function. The data_exam
dataset include 5 variables.
variables are as follows.:
id
: characteryear
: charactercount
: numericalpha
: characterflag
: character# create sample dataset
set.seed(123L)
id <- sapply(1:1000, function(x)
paste(c(sample(letters, 5), x), collapse = ""))
year <- "2018"
set.seed(123L)
count <- sample(1:10, size = 1000, replace = TRUE)
set.seed(123L)
alpha <- sample(letters, size = 1000, replace = TRUE)
set.seed(123L)
flag <- sample(c("Y", "N"), size = 1000, prob = c(0.1, 0.9), replace = TRUE)
data_exam <- data.frame(id, year, count, alpha, flag, stringsAsFactors = FALSE)
# structure of dataset
str(data_exam)
'data.frame': 1000 obs. of 5 variables:
$ id : chr "htjuw1" "bnvmk2" "ylqnc3" "xgbhu4" ...
$ year : chr "2018" "2018" "2018" "2018" ...
$ count: int 3 8 5 9 10 1 6 9 6 5 ...
$ alpha: chr "h" "u" "k" "w" ...
$ flag : chr "N" "N" "N" "N" ...
# summary of dataset
summary(data_exam)
id year count alpha
Length:1000 Length:1000 Min. : 1.000 Length:1000
Class :character Class :character 1st Qu.: 3.000 Class :character
Mode :character Mode :character Median : 5.000 Mode :character
Mean : 5.474
3rd Qu.: 8.000
Max. :10.000
flag
Length:1000
Class :character
Mode :character
cleanse()
cleans up the dataset before fitting the classification model.
The function of cleanse() is as follows.:
cleanse()
For example, we can cleanse all variables in data_exam
:
# cleansing dataset
newDat <- cleanse(data_exam)
─ Checking unique value ────────────── unique value is one ─
remove variables that unique value is one
● year
─ Checking unique rate ──────────────── high unique rate ─
remove variables with high unique rate
● id = 1000(1)
─ Checking character variables ──────────── categorical data ─
converts character variables to factor
● alpha
● flag
# structure of cleansing dataset
str(newDat)
'data.frame': 1000 obs. of 3 variables:
$ count: int 3 8 5 9 10 1 6 9 6 5 ...
$ alpha: Factor w/ 26 levels "a","b","c","d",..: 8 21 11 23 25 2 14 24 15 12 ...
$ flag : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 1 1 1 ...
remove variables whose unique value is one
: The year variable has only one value, “2018”. Not needed when fitting the model. So it was removed.remove variables with high unique rate
: If the number of levels of categorical data is very large, it is not suitable for classification model. In this case, it is highly likely to be an identifier of the data. So, remove the categorical (or character) variable with a high value of the unique rate defined as “number of levels / number of observations”.
converts character variables to factor
: The character type flag variable is converted to a factor type.For example, we can not remove the categorical data that is removed by changing the threshold of the unique rate
:
# cleansing dataset
newDat <- cleanse(data_exam, uniq_thres = 0.03)
─ Checking unique value ────────────── unique value is one ─
remove variables that unique value is one
● year
─ Checking unique rate ──────────────── high unique rate ─
remove variables with high unique rate
● id = 1000(1)
─ Checking character variables ──────────── categorical data ─
converts character variables to factor
● alpha
● flag
# structure of cleansing dataset
str(newDat)
'data.frame': 1000 obs. of 3 variables:
$ count: int 3 8 5 9 10 1 6 9 6 5 ...
$ alpha: Factor w/ 26 levels "a","b","c","d",..: 8 21 11 23 25 2 14 24 15 12 ...
$ flag : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 1 1 1 ...
The alpha
variable was not removed.
If you do not want to apply a unique rate, you can set the value of the uniq
argument to FALSE.:
# cleansing dataset
newDat <- cleanse(data_exam, uniq = FALSE)
─ Checking character variables ──────────── categorical data ─
converts character variables to factor
● id
● year
● alpha
● flag
# structure of cleansing dataset
str(newDat)
'data.frame': 1000 obs. of 5 variables:
$ id : Factor w/ 1000 levels "abety794","abkoe306",..: 301 59 929 890 904 694 997 465 134 124 ...
$ year : Factor w/ 1 level "2018": 1 1 1 1 1 1 1 1 1 1 ...
$ count: int 3 8 5 9 10 1 6 9 6 5 ...
$ alpha: Factor w/ 26 levels "a","b","c","d",..: 8 21 11 23 25 2 14 24 15 12 ...
$ flag : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 1 1 1 ...
If you do not want to force type conversion of a character variable to factor, you can set the value of the char
argument to FALSE.:
# cleansing dataset
newDat <- cleanse(data_exam, char = FALSE)
─ Checking unique value ────────────── unique value is one ─
remove variables that unique value is one
● year
─ Checking unique rate ──────────────── high unique rate ─
remove variables with high unique rate
● id = 1000(1)
# structure of cleansing dataset
str(newDat)
'data.frame': 1000 obs. of 3 variables:
$ count: int 3 8 5 9 10 1 6 9 6 5 ...
$ alpha: chr "h" "u" "k" "w" ...
$ flag : chr "N" "N" "N" "N" ...
If you want to remove a variable that contains missing values, specify the value of the missing
argument as TRUE. The following example removes the flag variable that contains the missing value.
data_exam$flag[1] <- NA
# cleansing dataset
newDat <- cleanse(data_exam, missing = TRUE)
─ Checking missing value ───────────────── included NA ─
remove variables whose included NA
● flag
─ Checking unique value ────────────── unique value is one ─
remove variables that unique value is one
● year
─ Checking unique rate ──────────────── high unique rate ─
remove variables with high unique rate
● id = 1000(1)
─ Checking character variables ──────────── categorical data ─
converts character variables to factor
● alpha
# structure of cleansing dataset
str(newDat)
'data.frame': 1000 obs. of 2 variables:
$ count: int 3 8 5 9 10 1 6 9 6 5 ...
$ alpha: Factor w/ 26 levels "a","b","c","d",..: 8 21 11 23 25 2 14 24 15 12 ...