step_ngram()
gained an argument min_num_tokens
to be able to return multiple n-grams together. (#90)step_text_normalization()
to perform unicode normalization on character vectors. (#86)step_word_embeddings()
got a argument aggregation_default
to specify value in cases where no words matches embedding.step_tokenize()
got an engine
argument to specify packages other then tokenizers to tokenize.spacyr
have been added as an engine to step_tokenize()
.step_lemma()
has been added to extract lemma attribute from tokenlists.step_pos_filter()
has been added to allow filtering of tokens bases on their pat of speech tags.step_ngram()
has been added to generate ngrams from tokenlists.step_stem()
not correctly uses the options argument. (Thanks to @grayskripko for finding bug, #64)step_word2vec()
have been changed to step_lda()
to reflect what is actually happening.step_word_embeddings()
has been added. Allows for use of pre-trained word embeddings to convert token columns to vectors in a high-dimensional “meaning” space. (@jonthegeek, #20)step_tfidf()
calculations are slightly changed due to flaw in original implementation https://github.com/dselivanov/text2vec/issues/280.step_textfeatures()
have been added, allows for multiple numerical features to be pulled from text.step_sequence_onehot()
have been added, allows for one hot encoding of sequences of fixed width.step_word2vec()
have been added, calculates word2vec dimensions.step_tokenmerge()
have been added, combines multiple list columns into one list-columns.step_texthash()
now correctly accepts signed
argument.step_tf()
and step_tfidf()
.First CRAN version