Read text files with readtext()

1. Introduction

The vignette walks you through importing a variety of different text files into R using the readtext package. Currently, readtext supports plain text files (.txt), data in some form of JavaScript Object Notation (.json), comma-or tab-separated values (.csv, .tab, .tsv), XML documents (.xml), as well as PDF and Microsoft Word formatted files (.pdf, .doc, .docx).

readtext also handles multiple files and file types using for instance a “glob” expression, files from a URL or an archive file (.zip, .tar, .tar.gz, .tar.bz). Usually, you do not have to determine the format of the files explicitly - readtext takes this information from the file ending.

The readtext package comes with a data directory called extdata that contains examples of all files listed above. In the vignette, we use this data directory.

# Get the data directory from readtext
DATA_DIR <- system.file("extdata/", package = "readtext")

The extdata directory contains several subfolders that include different text files. In the following examples, we load one or more files stored in each of these folders. The paste0 command is used to concatenate the extdata folder from the readtext package with the subfolders. When reading in custom text files, you will need to determine your own data directory (see ?setwd()).

2. Reading one or more text files

2.1 Plain text files (.txt)

The folder “txt” contains a subfolder named UDHR with .txt files of the Universal Declaration of Human Rights in 13 languages.

# Read in all files from a folder
readtext(paste0(DATA_DIR, "/txt/UDHR/*"))
## readtext object consisting of 13 documents and 0 docvars.
## # Description: df[,2] [13 × 2]
##   doc_id            text                         
##   <chr>             <chr>                        
## 1 UDHR_chinese.txt  "\"世界人权宣言\n联合国\"..."
## 2 UDHR_czech.txt    "\"VŠEOBECNÁ \"..."          
## 3 UDHR_danish.txt   "\"Den 10. de\"..."          
## 4 UDHR_english.txt  "\"Universal \"..."          
## 5 UDHR_french.txt   "\"Déclaratio\"..."          
## 6 UDHR_georgian.txt "\"FLFVBFYBC \"..."          
## # … with 7 more rows

We can specify document-level metadata (docvars) based on the file names or on a separate data.frame. Below we take the docvars from the filenames (docvarsfrom = "filenames") and set the names for each variable (docvarnames = c("unit", "context", "year", "language", "party")). The command dvsep = "_" determines the separator (a regular expression character string) included in the filenames to delimit the docvar elements.

# Manifestos with docvars from filenames
readtext(paste0(DATA_DIR, "/txt/EU_manifestos/*.txt"),
         docvarsfrom = "filenames", 
         docvarnames = c("unit", "context", "year", "language", "party"),
         dvsep = "_", 
         encoding = "ISO-8859-1")
## readtext object consisting of 17 documents and 5 docvars.
## # Description: df[,7] [17 × 7]
##   doc_id                  text                unit  context  year language party
##   <chr>                   <chr>               <chr> <chr>   <int> <chr>    <chr>
## 1 EU_euro_2004_de_PSE.txt "\"PES · PSE \"..." EU    euro     2004 de       PSE  
## 2 EU_euro_2004_de_V.txt   "\"Gemeinsame\"..." EU    euro     2004 de       V    
## 3 EU_euro_2004_en_PSE.txt "\"PES · PSE \"..." EU    euro     2004 en       PSE  
## 4 EU_euro_2004_en_V.txt   "\"Manifesto\n\"..… EU    euro     2004 en       V    
## 5 EU_euro_2004_es_PSE.txt "\"PES · PSE \"..." EU    euro     2004 es       PSE  
## 6 EU_euro_2004_es_V.txt   "\"Manifesto\n\"..… EU    euro     2004 es       V    
## # … with 11 more rows

readtext can also curse through subdirectories. In our example, the folder txt/movie_reviews contains two subfolders (called neg and pos). We can load all texts included in both folders.

# Recurse through subdirectories
readtext(paste0(DATA_DIR, "/txt/movie_reviews/*"))
## readtext object consisting of 10 documents and 0 docvars.
## # Description: df[,2] [10 × 2]
##   doc_id              text                
##   <chr>               <chr>               
## 1 neg_cv000_29416.txt "\"plot : two\"..." 
## 2 neg_cv001_19502.txt "\"the happy \"..." 
## 3 neg_cv002_17424.txt "\"it is movi\"..." 
## 4 neg_cv003_12683.txt "\" \" quest f\"..."
## 5 neg_cv004_12641.txt "\"synopsis :\"..." 
## 6 pos_cv000_29590.txt "\"films adap\"..." 
## # … with 4 more rows

2.2 Comma- or tab-separated values (.csv, .tab, .tsv)

Read in comma separated values (.csv files) that contain textual data. We determine the texts variable in our .csv file as the text_field. This is the column that contains the actual text. The other columns of the original csv file (Year, President, FirstName) are by default treated as document-level variables.

# Read in comma-separated values
readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"), text_field = "texts")
## readtext object consisting of 5 documents and 3 docvars.
## # Description: df[,5] [5 × 5]
##   doc_id            text                 Year President  FirstName
##   <chr>             <chr>               <int> <chr>      <chr>    
## 1 inaugCorpus.csv.1 "\"Fellow-Cit\"..."  1789 Washington George   
## 2 inaugCorpus.csv.2 "\"Fellow cit\"..."  1793 Washington George   
## 3 inaugCorpus.csv.3 "\"When it wa\"..."  1797 Adams      John     
## 4 inaugCorpus.csv.4 "\"Friends an\"..."  1801 Jefferson  Thomas   
## 5 inaugCorpus.csv.5 "\"Proceeding\"..."  1805 Jefferson  Thomas

The same procedure applies to tab-separated values.

# Read in tab-separated values
readtext(paste0(DATA_DIR, "/tsv/dailsample.tsv"), text_field = "speech")
## readtext object consisting of 33 documents and 9 docvars.
## # Description: df[,11] [33 × 11]
##   doc_id text  speechID memberID partyID constID title date  member_name
##   <chr>  <chr>    <int>    <int>   <int>   <int> <chr> <chr> <chr>      
## 1 dails… "\"M…        1      977      22     158 1. C… 1919… Count Geor…
## 2 dails… "\"I…        2     1603      22     103 1. C… 1919… Mr. Pádrai…
## 3 dails… "\"'…        3      116      22     178 1. C… 1919… Mr. Cathal…
## 4 dails… "\"T…        4      116      22     178 2. C… 1919… Mr. Cathal…
## 5 dails… "\"L…        5      116      22     178 3. A… 1919… Mr. Cathal…
## 6 dails… "\"-…        6      116      22     178 3. A… 1919… Mr. Cathal…
## # … with 27 more rows, and 2 more variables: party_name <chr>, const_name <chr>

2.3 JSON data (.json)

You can also read .json data. Again you need to specify the text_field.

## Read in JSON data
readtext(paste0(DATA_DIR, "/json/inaugural_sample.json"), text_field = "texts")
## readtext object consisting of 3 documents and 3 docvars.
## # Description: df[,5] [3 × 5]
##   doc_id                  text                 Year President  FirstName
##   <chr>                   <chr>               <int> <chr>      <chr>    
## 1 inaugural_sample.json.1 "\"Fellow-Cit\"..."  1789 Washington George   
## 2 inaugural_sample.json.2 "\"Fellow cit\"..."  1793 Washington George   
## 3 inaugural_sample.json.3 "\"When it wa\"..."  1797 Adams      John

2.4 PDF files

readtext can also read in and convert .pdf files.

In the example below we load all .pdf files stored in the UDHR folder, and determine that the docvars shall be taken from the filenames. We call the document-level variables document and language, and specify the delimiter (dvsep).

## Read in Universal Declaration of Human Rights pdf files
(rt_pdf <- readtext(paste0(DATA_DIR, "/pdf/UDHR/*.pdf"), 
                    docvarsfrom = "filenames", 
                    docvarnames = c("document", "language"),
                    sep = "_"))
## readtext object consisting of 11 documents and 2 docvars.
## # Description: df[,4] [11 × 4]
##   doc_id           text                          document language
##   <chr>            <chr>                         <chr>    <chr>   
## 1 UDHR_chinese.pdf "\"世界人权宣言\n联合国\"..." UDHR     chinese 
## 2 UDHR_czech.pdf   "\"VŠEOBECNÁ \"..."           UDHR     czech   
## 3 UDHR_danish.pdf  "\"Den 10. de\"..."           UDHR     danish  
## 4 UDHR_english.pdf "\"Universal \"..."           UDHR     english 
## 5 UDHR_french.pdf  "\"Déclaratio\"..."           UDHR     french  
## 6 UDHR_greek.pdf   "\"ΟΙΚΟΥΜΕΝΙΚ\"..."           UDHR     greek   
## # … with 5 more rows

2.5 Microsoft Word files (.doc, .docx)

Microsoft Word formatted files are converted through the package antiword for older .doc files, and using XML for newer .docx files.

## Read in Word data (.docx)
readtext(paste0(DATA_DIR, "/word/*.docx"))
## readtext object consisting of 2 documents and 0 docvars.
## # Description: df[,2] [2 × 2]
##   doc_id                      text               
##   <chr>                       <chr>              
## 1 UK_2015_EccentricParty.docx "\"The Eccent\"..."
## 2 UK_2015_LoonyParty.docx     "\"The Offici\"..."

2.6 Text from URLs

You can also read in text directly from a URL.

# Note: Example required: which URL should we use?

2.7 Text from archive files (.zip, .tar, .tar.gz, .tar.bz)

Finally, it is possible to include text from archives.

# Note: Archive file required. The only zip archive included in readtext has 
# different encodings and is difficult to import (see section 4.2).

# read in comma-separated values with readtext rt_csv <- readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"), text_field = "texts") # create quanteda corpus corpus_csv <- corpus(rt_csv) summary(corpus_csv, 5) ## Corpus consisting of 5 documents, showing 5 documents: ## ## Text Types Tokens Sentences Year President FirstName ## inaugCorpus.csv.1 625 1539 23 1789 Washington George ## inaugCorpus.csv.2 96 147 4 1793 Washington George ## inaugCorpus.csv.3 826 2577 37 1797 Adams John ## inaugCorpus.csv.4 717 1923 41 1801 Jefferson Thomas ## inaugCorpus.csv.5 804 2380 45 1805 Jefferson Thomas

# Make some text with page numbers sample_text_a <- "The quick brown fox named Seamus jumps over the lazy dog also named Seamus, page 1 with the newspaper from a boy named quick Seamus, in his mouth. page 2 The quicker brown fox jumped over 2 lazy dogs." sample_text_a ## [1] "The quick brown fox named Seamus jumps over the lazy dog also named Seamus, \npage 1 \nwith the newspaper from a boy named quick Seamus, in his mouth.\npage 2\nThe quicker brown fox jumped over 2 lazy dogs." # Remove "page" and respective digit sample_text_a2 <- unlist(stri_split_fixed(sample_text_a, '\n'), use.names = FALSE) sample_text_a2 <- stri_replace_all_regex(sample_text_a2, "page \\d*", "") sample_text_a2 <- stri_trim_both(sample_text_a2) sample_text_a2 <- sample_text_a2[sample_text_a2 != ''] stri_paste(sample_text_a2, collapse = '\n') ## [1] "The quick brown fox named Seamus jumps over the lazy dog also named Seamus,\nwith the newspaper from a boy named quick Seamus, in his mouth.\nThe quicker brown fox jumped over 2 lazy dogs."

sample_text_b <- "The quick brown fox named Seamus - 1 - jumps over the lazy dog also named Seamus, with - 2 - the newspaper from a boy named quick Seamus, in his mouth. - 33 - The quicker brown fox jumped over 2 lazy dogs." sample_text_b ## [1] "The quick brown fox named Seamus \n- 1 - \njumps over the lazy dog also named Seamus, with \n- 2 - \nthe newspaper from a boy named quick Seamus, in his mouth. \n- 33 - \nThe quicker brown fox jumped over 2 lazy dogs." sample_text_b2 <- unlist(stri_split_fixed(sample_text_b, '\n'), use.names = FALSE) sample_text_b2 <- stri_replace_all_regex(sample_text_b2, "[-] \\d* [-]", "") sample_text_b2 <- stri_trim_both(sample_text_b2) sample_text_b2 <- sample_text_b2[sample_text_b2 != ''] stri_paste(sample_text_b2, collapse = '\n') ## [1] "The quick brown fox named Seamus\njumps over the lazy dog also named Seamus, with\nthe newspaper from a boy named quick Seamus, in his mouth.\nThe quicker brown fox jumped over 2 lazy dogs."

txts <- readtext(paste0(DATA_DIR, "/data_files_encodedtexts.zip"), encoding = fileencodings, docvarsfrom = "filenames", docvarnames = c("document", "language", "input_encoding")) print(txts, n = 50) ## readtext object consisting of 36 documents and 3 docvars. ## # Description: df[,5] [36 × 5] ## doc_id text document language input_encoding ## <chr> <chr> <chr> <chr> <chr> ## 1 IndianTreaty_English_… "\"WHEREAS, t\"..." IndianTre… English UTF-16LE ## 2 IndianTreaty_English_… "\"ARTICLE 1.\"..." IndianTre… English UTF-8-BOM ## 3 UDHR_Arabic_ISO-8859-… "\"الديباجة\nل\"..… UDHR Arabic ISO-8859-6 ## 4 UDHR_Arabic_UTF-8.txt "\"الديباجة\nل\"..… UDHR Arabic UTF-8 ## 5 UDHR_Arabic_WINDOWS-1… "\"الديباجة\nل\"..… UDHR Arabic WINDOWS-1256 ## 6 UDHR_Chinese_GB2312.t… "\"世界人权宣言\n联合国\"..… UDHR Chinese GB2312 ## 7 UDHR_Chinese_GBK.txt "\"世界人权宣言\n联合国\"..… UDHR Chinese GBK ## 8 UDHR_Chinese_UTF-8.txt "\"世界人权宣言\n联合国\"..… UDHR Chinese UTF-8 ## 9 UDHR_English_UTF-16BE… "\"Universal \"..." UDHR English UTF-16BE ## 10 UDHR_English_UTF-16LE… "\"Universal \"..." UDHR English UTF-16LE ## 11 UDHR_English_UTF-8.txt "\"Universal \"..." UDHR English UTF-8 ## 12 UDHR_English_WINDOWS-… "\"Universal \"..." UDHR English WINDOWS-1252 ## 13 UDHR_French_ISO-8859-… "\"Déclaratio\"..." UDHR French ISO-8859-1 ## 14 UDHR_French_UTF-8.txt "\"Déclaratio\"..." UDHR French UTF-8 ## 15 UDHR_French_WINDOWS-1… "\"Déclaratio\"..." UDHR French WINDOWS-1252 ## 16 UDHR_German_ISO-8859-… "\"Die Allgem\"..." UDHR German ISO-8859-1 ## 17 UDHR_German_UTF-8.txt "\"Die Allgem\"..." UDHR German UTF-8 ## 18 UDHR_German_WINDOWS-1… "\"Die Allgem\"..." UDHR German WINDOWS-1252 ## 19 UDHR_Greek_CP1253.txt "\"ΟΙΚΟΥΜΕΝΙΚ\"..." UDHR Greek CP1253 ## 20 UDHR_Greek_ISO-8859-7… "\"ΟΙΚΟΥΜΕΝΙΚ\"..." UDHR Greek ISO-8859-7 ## 21 UDHR_Greek_UTF-8.txt "\"ΟΙΚΟΥΜΕΝΙΚ\"..." UDHR Greek UTF-8 ## 22 UDHR_Hindi_UTF-8.txt "\"मानव अधिका\"..." UDHR Hindi UTF-8 ## 23 UDHR_Icelandic_ISO-88… "\"Mannréttin\"..." UDHR Iceland… ISO-8859-1 ## 24 UDHR_Icelandic_UTF-8.… "\"Mannréttin\"..." UDHR Iceland… UTF-8 ## 25 UDHR_Icelandic_WINDOW… "\"Mannréttin\"..." UDHR Iceland… WINDOWS-1252 ## 26 UDHR_Japanese_CP932.t… "\"『世界人権宣言』\n \"..… UDHR Japanese CP932 ## 27 UDHR_Japanese_ISO-202… "\"『世界人権宣言』\n \"..… UDHR Japanese ISO-2022-JP ## 28 UDHR_Japanese_UTF-8.t… "\"『世界人権宣言』\n \"..… UDHR Japanese UTF-8 ## 29 UDHR_Japanese_WINDOWS… "\"『世界人権宣言』\n \"..… UDHR Japanese WINDOWS-936 ## 30 UDHR_Korean_ISO-2022-… "\"세 계 인 권 선 \"...… UDHR Korean ISO-2022-KR ## 31 UDHR_Korean_UTF-8.txt "\"세 계 인 권 선 \"...… UDHR Korean UTF-8 ## 32 UDHR_Russian_ISO-8859… "\"Всеобщая д\"..." UDHR Russian ISO-8859-5 ## 33 UDHR_Russian_KOI8-R.t… "\"Всеобщая д\"..." UDHR Russian KOI8-R ## 34 UDHR_Russian_UTF-8.txt "\"Всеобщая д\"..." UDHR Russian UTF-8 ## 35 UDHR_Russian_WINDOWS-… "\"Всеобщая д\"..." UDHR Russian WINDOWS-1251 ## 36 UDHR_Thai_UTF-8.txt "\"ปฏิญญาสากล\"..." UDHR Thai UTF-8

corpus_txts <- corpus(txts) summary(corpus_txts, 5) ## Corpus consisting of 36 documents, showing 5 documents: ## ## Text Types Tokens Sentences document ## IndianTreaty_English_UTF-16LE.txt 617 2577 155 IndianTreaty ## IndianTreaty_English_UTF-8-BOM.txt 645 3092 154 IndianTreaty ## UDHR_Arabic_ISO-8859-6.txt 753 1555 86 UDHR ## UDHR_Arabic_UTF-8.txt 753 1555 86 UDHR ## UDHR_Arabic_WINDOWS-1256.txt 753 1555 86 UDHR ## language input_encoding ## English UTF-16LE ## English UTF-8-BOM ## Arabic ISO-8859-6 ## Arabic UTF-8 ## Arabic WINDOWS-1256