How to use rodham

Get started

The function search_emails allows fetching the list of emails that were released. These are available either by calling the Wall Street Journal’s API or via the built-in dataset (recommended).

library(rodham)

# get list of emails
data("emails")

# equivalent to:
em <- search_emails()

identical(emails, em)

## [1] FALSE

Simple Network

Using the list of emails (data("emails")) we can plot the network of emails using edges_emails which returns a list of edges meant for a directed network.

edges <- edges_emails(emails)
knitr::kable(head(edges))

	from	to	freq
2	Cheryl Mills	Hillary Clinton	4895
105	Jake Sullivan	Hillary Clinton	4538
57	Huma Abedin	Hillary Clinton	3726
139	Hillary Clinton	Jake Sullivan	1641
143	Hillary Clinton	Cheryl Mills	1312
153	Hillary Clinton	Huma Abedin	951

The freq corresponds to the occurences of edges (number of emails). The list of edges alone allows building a simple network.

g <- igraph::graph.data.frame(edges)
# plot network
plot(g, layout = igraph::layout.fruchterman.reingold(g),
     vertex.label.color = hsv(h = 0, s = 0, v = 0, alpha = 0.0), 
     vertex.size = log1p(igraph::degree(g)) * 2, edge.arrow.size = 0.1, 
     edge.arrow.width = 0.1, edge.width = log1p(igraph::E(g)$freq)/4,
     vertex.frame.color="#FFFFFF")

Get emails content

The fast way

In the above we gather a reasonable amount of meta-data on the emails but we do not get the actual content of the emails. To do so we need to download the emails—as released—in PDF format and extract the text. First we are going to need xpdf to extract the content; you can either download it manually from the download setion or you can attempt using get_xpdf (only tested on windows). get_xpdf downloads then unzips the extractor then returns the full path to the pdftotext.exe file required for the next step.

xpdf <- get_xpdf(dest = "C:/") # get extractor
# or if you downloaded manually point to pdftotext
xpdf <- "your/path/xpdfbin-win-3.04/bin64/pdftotext"

Once we have the extractor we can fetch some emails using get_emails, the function requires you to select a specific release, here are the valid ones:

Benghazi
June
July
August
September
October
November
January 7
February 13
January 19
February 29
December
Non-disclosure

dir.create(dir) # directory must exist
emails_bengh <- get_emails(release = "Benghazi", save.dir = "./rodham", extractor = xpdf)

get_emails downloads, unzips and extracts the content from all email; note that this may take some time. The files will be extracted in a folder named after the requested release and its full path returned (for future use).

Step by step

Alternatively you may want to proceed step by step. This is particularly useful if your temp folder requires super user or if you want to keep the pdf files.

# download specific release
dl <- download_emails("August") # returns full pass to zip

pdf <- "emails_pdf" # directory where pdf will be extracted to
txt <- "emails.text" # directory where txt will be extracted to

# create directories
dir.create(pdf)
dir.create(emails_bengh)

unzip(dl, exdir = pdf)

# get emails released in august
extract_emails(pdf, save.dir = txt, extractor = ext)

Load the emails

Now we can read the .txt files in R to a named list where the each email is named after its corresponding file.

contents <- load_emails(emails_bengh)

You can clean the emails with clean_content it’ll remove some comments and other unwanted lines.

cont <- get_content(contents)
cont <- clean_content(cont)

Methods

get_content