Why textmineR?

textmineR was created with three principles in mind:

Maximize interoperability within R’s ecosystem
Scaleable in terms of object storage and computation time
Syntax that is idiomatic to R

R has many packages for text mining and natural language processing (NLP). The CRAN task view on natural language processing lists 53 unique packages. Some of these packages are interoperable. Some are not.

textmineR strives for maximum interoperability in three ways. First, it uses the dgCMatrix class from the popular Matrix package for document term matrices (DTMs) and term co-occurrence matrices (TCMs). The Matrix package is an R “recommended” package with nearly 500 packages that depend, import, or suggest it. Compare that to the slam package used by tm and its derivatives. slam has an order of magnitude fewer dependents. It is simply not as well integrated. Matrix also has methods that make the syntax for manipulating its matrices nearly identical to base R. This greatly reduces the cognitive burden of the programmers.

Second, textmineR relies on base R objects for corpus and metadata storage. Actually, it relies on the user to do so. textmineR’s core functions CreateDtm and CreateTcm take a simple character vector as input. Users may store their corpora as character vectors, lists, or data frames. There is no need to learn a new ‘Corpus’ class.

Third and last, textmineR represents the output of topic models in a consistent way, a list containing two matrices. This is described in more detail in the next section. Several topic models are supported and the simple representation means that textmineR’s utility functions are usable with outputs from other packages, so long as they are represented as matrices of probabilities. (Again, see the next section for more detail.)

textmineR achieves scalability through three means. First, sparse matrices (like the dgCMatrix) offer significant memory savings. Second, textmineR utilizes Rcpp throughout for speedup. Finally, textmineR uses parallel processing by default where possible. textmineR offers a function TmParallelApply which implements a framework for parallel processing that is syntactically agnostic between Windows and Unix-like operating systems. TmParallelApply is used liberally within textmineR and is exposed for users.

textmineR does make some tradeoffs of performance for syntactic simplicity. textmineR is designed to run on a single node in a cluster computing environment. It can (and will by default) use all available cores of that node. If performance is your number one concern, see text2vec. textmineR uses some text2vec under the hood.

textmineR strives for syntax that is idiomatic to R. This is, admittedly, a nebulous concept. textmineR does not create new classes where existing R classes exist. It strives for a functional programming paradigm. And it attempts to group closely-related sequential steps into single functions. This means that users will not have to make several temporary objects along the way. As an example, compare making a document term matrix in textmineR (example below) with tm or text2vec.

As a side note: textmineR’s framework for NLP does not need to be exclusive to textmineR. Text mining packages in R can be interoperable with a few concepts. First, use dgCMatrix for DTMs and TCMs. Second, write most text mining models in a way that they can take a dgCMatrix as the input. Finally, keep non-base R classes to a minimum, especially for corpus and metadata management.

colnames(dtm)
making_debut
bureau_lowly
scenes_thing
injections
frying_pan
renounced_assassin

rownames(dtm)
2595_9
8892_2
8620_8
2892_10
232_1
4364_1

Basic corpus statistics

The code below performs some basic corpus statistics. textmineR has a built in function for getting term frequencies across the corpus. This function TermDocFreq gives term frequencies (equivalent to colSums(dtm)), the number of documents in which each term appears (equivalent to colSums(dtm > 0)), and an inverse-document frequency (IDF) vector. The IDF vector can be used to create a TF-IDF matrix.


# get counts of tokens across the corpus
tf_mat <- TermDocFreq(dtm = dtm)

str(tf_mat) 
#> 'data.frame':    55459 obs. of  4 variables:
#>  $ term     : chr  "making_debut" "bureau_lowly" "scenes_thing" "injections" ...
#>  $ term_freq: num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ doc_freq : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ idf      : num  6.21 6.21 6.21 6.21 6.21 ...

# look at the most frequent tokens
head(tf_mat[ order(tf_mat$term_freq, decreasing = TRUE) , ], 10)

Ten most frequent tokens
	term	term_freq	doc_freq	idf
br	br	2148	312	0.4716049
br_br	br_br	1078	312	0.4716049
movie	movie	878	310	0.4780358
film	film	835	284	0.5656339
good	good	333	203	0.9014021
story	story	277	167	1.0966143
time	time	271	180	1.0216512
bad	bad	199	118	1.4439235
great	great	195	138	1.2873544
made	made	173	137	1.2946272

# look at the most frequent bigrams
tf_bigrams <- tf_mat[ stringr::str_detect(tf_mat$term, "_") , ]

head(tf_bigrams[ order(tf_bigrams$term_freq, decreasing = TRUE) , ], 10)

Ten most frequent bi-grams
	term	term_freq	doc_freq	idf
br_br	br_br	1078	312	0.4716049
br_film	br_film	48	41	2.5010360
br_movie	br_movie	41	36	2.6310892
film_br	film_br	32	26	2.9565116
movie_br	movie_br	29	27	2.9187712
special_effects	special_effects	21	19	3.2701691
good_movie	good_movie	16	15	3.5065579
long_time	long_time	15	15	3.5065579
high_school	high_school	15	10	3.9120230
scooby_doo	scooby_doo	15	1	6.2146081

It looks like we have stray html tags (“<br>”) in the documents. These aren’t giving us any relevant information about content. (Except, perhaps, that these documents were originally part of web pages.)

The most intuitive approach, perhaps, is to strip these tags from our documents, re-construct a document term matrix, and re-calculate the objects as above. However, a simpler approach would be to simply remove the tokens containing “br” from the DTM we already calculated. This is much more computationally efficient and gives us the same result anyway.

# remove offending tokens from the DTM
dtm <- dtm[ , ! stringr::str_detect(colnames(dtm),
                                    "(^br$)|(_br$)|(^br_)") ]

# re-construct tf_mat and tf_bigrams
tf_mat <- TermDocFreq(dtm)

tf_bigrams <- tf_mat[ stringr::str_detect(tf_mat$term, "_") , ]

head(tf_mat[ order(tf_mat$term_freq, decreasing = TRUE) , ], 10)
#>        term term_freq doc_freq       idf
#> movie movie       878      310 0.4780358
#> film   film       835      284 0.5656339
#> good   good       333      203 0.9014021
#> story story       277      167 1.0966143
#> time   time       271      180 1.0216512
#> bad     bad       199      118 1.4439235
#> great great       195      138 1.2873544
#> made   made       173      137 1.2946272
#> watch watch       158      127 1.3704210
#> films films       153       93 1.6820086

Ten most frequent terms, ‘<br>’ removed
	term	term_freq	doc_freq	idf
movie	movie	878	310	0.4780358
film	film	835	284	0.5656339
good	good	333	203	0.9014021
story	story	277	167	1.0966143
time	time	271	180	1.0216512
bad	bad	199	118	1.4439235
great	great	195	138	1.2873544
made	made	173	137	1.2946272
watch	watch	158	127	1.3704210
films	films	153	93	1.6820086

head(tf_bigrams[ order(tf_bigrams$term_freq, decreasing = TRUE) , ], 10)

Ten most frequent bi-grams, ‘<br>’ removed
	term	term_freq	doc_freq	idf
special_effects	special_effects	21	19	3.270169
good_movie	good_movie	16	15	3.506558
long_time	long_time	15	15	3.506558
high_school	high_school	15	10	3.912023
scooby_doo	scooby_doo	15	1	6.214608
low_budget	low_budget	15	13	3.649659
watch_movie	watch_movie	14	13	3.649659
make_film	make_film	14	13	3.649659
years_ago	years_ago	14	13	3.649659
movie_good	movie_good	13	13	3.649659

We can also calculate how many tokens each document contains from the DTM. Note that this reflects the modifications we made in constructing the DTM (removing stop words, punctuation, numbers, etc.).

# summary of document lengths
doc_lengths <- rowSums(dtm)

summary(doc_lengths)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    23.0    96.0   140.5   186.1   245.0   768.0

Often,it’s useful to prune your vocabulary and remove any tokens that appear in a small number of documents. This will greatly reduce the vocabulary size (see Zipf’s law) and improve computation time.

# remove any tokens that were in 3 or fewer documents
dtm <- dtm[ , colSums(dtm > 0) > 3 ] # alternatively: dtm[ , tf_mat$term_freq > 3 ]

tf_mat <- tf_mat[ tf_mat$term %in% colnames(dtm) , ]

tf_bigrams <- tf_bigrams[ tf_bigrams$term %in% colnames(dtm) , ]

The movie review data set contains more than just text of reviews. It also contains a variable tagging the review as positive (movie_review$sentiment $=1$) or negative (movie_review$sentiment $=0$). We can examine terms associated with positive and negative reviews. If we wanted, we could use them to build a simple classifier.

However, as we will see immediately below, looking at only the most frequent terms in each category is not helpful. Because of Zipf’s law, the most frequent terms in just about any category will be the same.

# what words are most associated with sentiment?
tf_sentiment <- list(positive = TermDocFreq(dtm[ movie_review$sentiment == 1 , ]),
                     negative = TermDocFreq(dtm[ movie_review$sentiment == 0 , ]))

These are basically the same. Not helpful at all.

head(tf_sentiment$positive[ order(tf_sentiment$positive$term_freq, decreasing = TRUE) , ], 10)

Ten most-frequent positive tokens
	term	term_freq	doc_freq	idf
movie	movie	358	128	0.5990082
film	film	349	125	0.6227247
story	story	143	82	1.0443192
good	good	138	83	1.0321978
time	time	125	82	1.0443192
great	great	119	79	1.0815906
watch	watch	82	59	1.3735010
love	love	71	49	1.5592182
life	life	69	49	1.5592182
character	character	69	53	1.4807465

head(tf_sentiment$negative[ order(tf_sentiment$negative$term_freq, decreasing = TRUE) , ], 10)

Ten most-frequent negative tokens
	term	term_freq	doc_freq	idf
movie	movie	520	182	0.3832420
film	film	486	159	0.5183445
good	good	195	120	0.7997569
bad	bad	164	90	1.0874390
time	time	146	98	1.0022812
story	story	134	85	1.1445974
made	made	111	83	1.1684081
people	people	104	68	1.3677410
acting	acting	102	79	1.2178008
make	make	89	70	1.3387534

That was unhelpful. Instead, we need to re-weight the terms in each class. We’ll use a probabilistic reweighting, described below.

The most frequent words in each class are proportional to $P(word|sentiment_j)$. As we saw above, that would puts the words in the same order as $P(word)$, overall. However, we can use the difference in those probabilities to get a new order. That difference is

\[\begin{align} P(word|sentiment_j) - P(word) \end{align}\]

You can interpret the difference in (1) as follows: Positive values are more probable in the sentiment class than in the corpus overall. Negative values are less probable. Values close to zero are statistically-independent of sentiment. Since most of the top words are the same when we sort by $P(word|sentiment_j)$, these words are statistically-independent of sentiment. They get forced towards zero.

For those paying close attention, this difference should give a similar ordering as pointwise-mutual information (PMI), defined as $PMI = \frac{P(word|sentiment_j)}{P(word)}$. However, I prefer the difference as it is bound between $-1$ and $1$.

The difference method is applied to both words overall and bi-grams in the code below.


# let's reweight by probability by class
p_words <- colSums(dtm) / sum(dtm) # alternatively: tf_mat$term_freq / sum(tf_mat$term_freq)

tf_sentiment$positive$conditional_prob <- 
  tf_sentiment$positive$term_freq / sum(tf_sentiment$positive$term_freq)

tf_sentiment$positive$prob_lift <- tf_sentiment$positive$conditional_prob - p_words

tf_sentiment$negative$conditional_prob <- 
  tf_sentiment$negative$term_freq / sum(tf_sentiment$negative$term_freq)

tf_sentiment$negative$prob_lift <- tf_sentiment$negative$conditional_prob - p_words

# let's look again with new weights
head(tf_sentiment$positive[ order(tf_sentiment$positive$prob_lift, decreasing = TRUE) , ], 10)

Reweighted: ten most relevant terms for positive sentiment
	term	term_freq	doc_freq	idf	conditional_prob	prob_lift
great	great	119	79	1.081591	0.0081168	0.0022971
heart	heart	42	17	2.617825	0.0028647	0.0015217
story	story	143	82	1.044319	0.0097538	0.0014868
life	life	69	49	1.559218	0.0047064	0.0012444
excellent	excellent	38	33	1.954531	0.0025919	0.0012191
beautiful	beautiful	39	28	2.118834	0.0026601	0.0011977
find	find	51	41	1.737466	0.0034786	0.0009418
world	world	49	38	1.813452	0.0033422	0.0008950
watch	watch	82	59	1.373501	0.0055931	0.0008776
years	years	60	43	1.689838	0.0040925	0.0008693

head(tf_sentiment$negative[ order(tf_sentiment$negative$prob_lift, decreasing = TRUE) , ], 10)

Reweighted: ten most relevant terms for negative sentiment
	term	term_freq	doc_freq	idf	conditional_prob	prob_lift
bad	bad	164	90	1.0874390	0.0087021	0.0027631
movie	movie	520	182	0.3832420	0.0275921	0.0013886
people	people	104	68	1.3677410	0.0055184	0.0011313
worst	worst	55	48	1.7160476	0.0029184	0.0011277
script	script	62	50	1.6752257	0.0032898	0.0009023
acting	acting	102	79	1.2178008	0.0054123	0.0008759
film	film	486	159	0.5183445	0.0257880	0.0008678
guy	guy	55	33	2.0907411	0.0029184	0.0008591
thing	thing	66	56	1.5618970	0.0035021	0.0007564
awful	awful	35	24	2.4091948	0.0018572	0.0007529

# what about bi-grams?
tf_sentiment_bigram <- lapply(tf_sentiment, function(x){
  x <- x[ stringr::str_detect(x$term, "_") , ]
  x[ order(x$prob_lift, decreasing = TRUE) , ]
})

head(tf_sentiment_bigram$positive, 10)

Reweighted: ten most relevant bigrams for positive sentiment
	term	term_freq	doc_freq	idf	conditional_prob	prob_lift
highly_recommend	highly_recommend	11	11	3.053143	0.0007503	0.0003922
big_screen	big_screen	8	5	3.841601	0.0005457	0.0002771
real_life	real_life	9	8	3.371597	0.0006139	0.0002557
world_war	world_war	8	5	3.841601	0.0005457	0.0002174
watched_movie	watched_movie	7	7	3.505128	0.0004775	0.0002089
enjoy_watching	enjoy_watching	6	6	3.659279	0.0004092	0.0002003
years_ago	years_ago	9	8	3.371597	0.0006139	0.0001961
makes_movie	makes_movie	5	5	3.841601	0.0003410	0.0001918
loved_movie	loved_movie	5	5	3.841601	0.0003410	0.0001918
movie_worth	movie_worth	6	6	3.659279	0.0004092	0.0001705

head(tf_sentiment_bigram$negative, 10)

Reweighted: ten most relevant bigrams for negative sentiment
	term	term_freq	doc_freq	idf	conditional_prob	prob_lift
good_thing	good_thing	11	11	3.189353	0.0005837	0.0002554
waste_time	waste_time	12	11	3.189353	0.0006367	0.0002488
acting_bad	acting_bad	10	9	3.390024	0.0005306	0.0002322
bad_acting	bad_acting	9	9	3.390024	0.0004776	0.0002090
worst_movie	worst_movie	10	10	3.284664	0.0005306	0.0002023
read_book	read_book	8	6	3.795489	0.0004245	0.0001857
comic_book	comic_book	8	6	3.795489	0.0004245	0.0001857
great_idea	great_idea	7	7	3.641338	0.0003714	0.0001625
bad_guys	bad_guys	8	6	3.795489	0.0004245	0.0001559
make_sense	make_sense	8	6	3.795489	0.0004245	0.0001559

1. Start here

Thomas W. Jones

2019-04-17

Why textmineR?

Corpus management

Creating a DTM

Basic corpus statistics