How to use TfidfVectorizer in R ?

Manish Saraswat

2020-04-27

In this tutorial, we’ll look at how to create tfidf feature matrix in R in two simple steps with superml. Superml borrows speed gains using parallel computation and optimised functions from data.table R package. Tfidf matrix can be used to as features for a machine learning model. Also, we can use tdidf features as an embedding to represent the given texts.

Install

You can install latest cran version using (recommended):

install.packages("superml")

You can install the developmemt version directly from github using:

devtools::install_github("saraswatmks/superml")

Caveats on superml installation

For machine learning, superml is based on the existing R packages. Hence, while installing the package, we don’t install all the dependencies. However, while training any model, superml will automatically install the package if its not found. Still, if you want to install all dependencies at once, you can simply do:

install.packages("superml", dependencies=TRUE)

Sample Data

First, we’ll create a sample data. Feel free to run it alongside in your laptop and check the results.

library(superml)

# should be a vector of texts
sents <-  c('i am going home and home',
          'where are you going.? //// ',
          'how does it work',
          'transform your work and go work again',
          'home is where you go from to work')

# generate more sentences
n <- 10
sents <- rep(sents, n) 
length(sents)
#> [1] 50

For sample, we’ve generated 50 documents. Let’s create the features now. For ease, superml uses the similar API layout as python scikit-learn.

# initialise the class
tfv <- TfIdfVectorizer$new(max_features = 10, remove_stopwords = FALSE)

# generate the matrix
tf_mat <- tfv$fit_transform(sents)

head(tf_mat, 3)
#>      work      home       and go     going     where       you again       am
#> [1,]    0 0.7159943 0.3579971  0 0.3579971 0.0000000 0.0000000     0 0.480654
#> [2,]    0 0.0000000 0.0000000  0 0.4563106 0.4563106 0.4563106     0 0.000000
#> [3,]    1 0.0000000 0.0000000  0 0.0000000 0.0000000 0.0000000     0 0.000000
#>            are
#> [1,] 0.0000000
#> [2,] 0.6126516
#> [3,] 0.0000000

Few observations:

remove_stopwords = FALSE defaults to TRUE. We set it to FALSE since most of the words in our dummy sents are stopwords.
max_features = 10 select the top 10 features (tokens) based on frequency.
The returned matrix is normalised by default, norm = TRUE is set by default.

Now, let’s generate the matrix using its ngram_range features.

# initialise the class
tfv <- TfIdfVectorizer$new(min_df = 0.4, remove_stopwords = FALSE, ngram_range = c(1, 3))

# generate the matrix
tf_mat <- tfv$fit_transform(sents)

head(tf_mat, 3)
#>      work      home       and go     going     where       you
#> [1,]    0 0.8164966 0.4082483  0 0.4082483 0.0000000 0.0000000
#> [2,]    0 0.0000000 0.0000000  0 0.5773503 0.5773503 0.5773503
#> [3,]    1 0.0000000 0.0000000  0 0.0000000 0.0000000 0.0000000

Few observations:

ngram_range = c(1,3) set the lower and higher range respectively of the resulting ngram tokens.
min_df = 0.4 says to keep the tokens which occurs in atleast 40% & above of the documents.

library(data.table) library(superml) # use sents from above sents <- c('i am going home and home', 'where are you going.? //// ', 'how does it work', 'transform your work and go work again', 'home is where you go from to work', 'how does it work') # create dummy data train <- data.table(text = sents, target = rep(c(0,1), 3)) test <- data.table(text = sample(sents), target = rep(c(0,1), 3))

head(test, 3) #> text target #> 1: home is where you go from to work 0 #> 2: how does it work 1 #> 3: how does it work 0

# initialise the class tfv <- TfIdfVectorizer$new(min_df = 0.3, remove_stopwords = FALSE, ngram_range = c(1,3)) # we fit on train data tfv$fit(train$text) train_tf_features <- tfv$transform(train$text) test_tf_features <- tfv$transform(test$text) dim(train_tf_features) #> [1] 6 15 dim(test_tf_features) #> [1] 6 15

head(train_tf_features, 3) #> work home and does does it does it work go #> [1,] 0.0000000 0.8164966 0.4082483 0.0000000 0.0000000 0.0000000 0 #> [2,] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0 #> [3,] 0.2478085 0.0000000 0.0000000 0.3425257 0.3425257 0.3425257 0 #> going how how does how does it it it work where #> [1,] 0.4082483 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 #> [2,] 0.5773503 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.5773503 #> [3,] 0.0000000 0.3425257 0.3425257 0.3425257 0.3425257 0.3425257 0.0000000 #> you #> [1,] 0.0000000 #> [2,] 0.5773503 #> [3,] 0.0000000

head(test_tf_features, 3) #> work home and does does it does it work go going #> [1,] 0.3401651 0.4701829 0 0.0000000 0.0000000 0.0000000 0.4701829 0 #> [2,] 0.2478085 0.0000000 0 0.3425257 0.3425257 0.3425257 0.0000000 0 #> [3,] 0.2478085 0.0000000 0 0.3425257 0.3425257 0.3425257 0.0000000 0 #> how how does how does it it it work where you #> [1,] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.4701829 0.4701829 #> [2,] 0.3425257 0.3425257 0.3425257 0.3425257 0.3425257 0.0000000 0.0000000 #> [3,] 0.3425257 0.3425257 0.3425257 0.3425257 0.3425257 0.0000000 0.0000000

# ensure the input to classifier is a data.table or data.frame object x_train <- data.table(cbind(train_tf_features, target = train$target)) x_test <- data.table(test_tf_features) xgb <- XGBTrainer$new(n_estimators = 10, objective = "binary:logistic") xgb$fit(x_train, "target") #> converting the data into xgboost format.. #> starting with training... #> [1] train-error:0.500000 #> Will train until train_error hasn't improved in 50 rounds. #> #> [10] train-error:0.500000 predictions <- xgb$predict(x_test) predictions #> [1] 0.5 0.5 0.5 0.5 0.5 0.5

Summary

In this tutorial, we discussed how to use superml’s tfidfvectorizer to create tfidf matrix and train a machine learning model on it.