How to use rodham

John Coene

2017-07-18

What is this anyway?

rodham aims at easing access and analysis of Hillary Rodham Clinton’s personal emails which are deemed important to the author in light of recent events.

Get started

The function search_emails allows fetching the list of emails that were released. These are available either by calling the Wall Street Journal’s API or via the built-in dataset (recommended).

library(rodham)

# get list of emails
data("emails")

# equivalent to:
em <- search_emails()

identical(emails, em)
## [1] FALSE

Simple Network

Using the list of emails (data("emails")) we can plot the network of emails using edges_emails which returns a list of edges meant for a directed network.

edges <- edges_emails(emails)
knitr::kable(head(edges))
from to freq
2 Cheryl Mills Hillary Clinton 4895
105 Jake Sullivan Hillary Clinton 4538
57 Huma Abedin Hillary Clinton 3726
139 Hillary Clinton Jake Sullivan 1641
143 Hillary Clinton Cheryl Mills 1312
153 Hillary Clinton Huma Abedin 951

The freq corresponds to the occurences of edges (number of emails). The list of edges alone allows building a simple network.

g <- igraph::graph.data.frame(edges)
# plot network
plot(g, layout = igraph::layout.fruchterman.reingold(g),
     vertex.label.color = hsv(h = 0, s = 0, v = 0, alpha = 0.0), 
     vertex.size = log1p(igraph::degree(g)) * 2, edge.arrow.size = 0.1, 
     edge.arrow.width = 0.1, edge.width = log1p(igraph::E(g)$freq)/4,
     vertex.frame.color="#FFFFFF")

Get emails content

The fast way

In the above we gather a reasonable amount of meta-data on the emails but we do not get the actual content of the emails. To do so we need to download the emails—as released—in PDF format and extract the text. First we are going to need xpdf to extract the content; you can either download it manually from the download setion or you can attempt using get_xpdf (only tested on windows). get_xpdf downloads then unzips the extractor then returns the full path to the pdftotext.exe file required for the next step.

xpdf <- get_xpdf(dest = "C:/") # get extractor
# or if you downloaded manually point to pdftotext
xpdf <- "your/path/xpdfbin-win-3.04/bin64/pdftotext"

Once we have the extractor we can fetch some emails using get_emails, the function requires you to select a specific release, here are the valid ones:

dir.create(dir) # directory must exist
emails_bengh <- get_emails(release = "Benghazi", save.dir = "./rodham", extractor = xpdf)

get_emails downloads, unzips and extracts the content from all email; note that this may take some time. The files will be extracted in a folder named after the requested release and its full path returned (for future use).

Step by step

Alternatively you may want to proceed step by step. This is particularly useful if your temp folder requires super user or if you want to keep the pdf files.

# download specific release
dl <- download_emails("August") # returns full pass to zip

pdf <- "emails_pdf" # directory where pdf will be extracted to
txt <- "emails.text" # directory where txt will be extracted to

# create directories
dir.create(pdf)
dir.create(emails_bengh)

unzip(dl, exdir = pdf)

# get emails released in august
extract_emails(pdf, save.dir = txt, extractor = ext)

Load the emails

Now we can read the .txt files in R to a named list where the each email is named after its corresponding file.

contents <- load_emails(emails_bengh)

You can clean the emails with clean_content it’ll remove some comments and other unwanted lines.

cont <- get_content(contents)
cont <- clean_content(cont)

Methods

get_content