rodham
aims at easing access and analysis of Hillary Rodham Clinton’s personal emails which are deemed important to the author in light of recent events.
The function search_emails
allows fetching the list of emails that were released. These are available either by calling the Wall Street Journal’s API or via the built-in dataset (recommended).
library(rodham)
# get list of emails
data("emails")
# equivalent to:
em <- search_emails()
identical(emails, em)
## [1] FALSE
Using the list of emails (data("emails")
) we can plot the network of emails using edges_emails
which returns a list of edges meant for a directed network.
edges <- edges_emails(emails)
knitr::kable(head(edges))
from | to | freq | |
---|---|---|---|
2 | Cheryl Mills | Hillary Clinton | 4895 |
105 | Jake Sullivan | Hillary Clinton | 4538 |
57 | Huma Abedin | Hillary Clinton | 3726 |
139 | Hillary Clinton | Jake Sullivan | 1641 |
143 | Hillary Clinton | Cheryl Mills | 1312 |
153 | Hillary Clinton | Huma Abedin | 951 |
The freq
corresponds to the occurences of edges (number of emails). The list of edges alone allows building a simple network.
g <- igraph::graph.data.frame(edges)
# plot network
plot(g, layout = igraph::layout.fruchterman.reingold(g),
vertex.label.color = hsv(h = 0, s = 0, v = 0, alpha = 0.0),
vertex.size = log1p(igraph::degree(g)) * 2, edge.arrow.size = 0.1,
edge.arrow.width = 0.1, edge.width = log1p(igraph::E(g)$freq)/4,
vertex.frame.color="#FFFFFF")
In the above we gather a reasonable amount of meta-data on the emails but we do not get the actual content of the emails. To do so we need to download the emails—as released—in PDF format and extract the text. First we are going to need xpdf to extract the content; you can either download it manually from the download setion or you can attempt using get_xpdf
(only tested on windows). get_xpdf
downloads then unzips the extractor then returns the full path to the pdftotext.exe file required for the next step.
xpdf <- get_xpdf(dest = "C:/") # get extractor
# or if you downloaded manually point to pdftotext
xpdf <- "your/path/xpdfbin-win-3.04/bin64/pdftotext"
Once we have the extractor we can fetch some emails using get_emails
, the function requires you to select a specific release
, here are the valid ones:
dir.create(dir) # directory must exist
emails_bengh <- get_emails(release = "Benghazi", save.dir = "./rodham", extractor = xpdf)
get_emails
downloads, unzips and extracts the content from all email; note that this may take some time. The files will be extracted in a folder named after the requested release
and its full path returned (for future use).
Alternatively you may want to proceed step by step. This is particularly useful if your temp folder requires super user or if you want to keep the pdf files.
# download specific release
dl <- download_emails("August") # returns full pass to zip
pdf <- "emails_pdf" # directory where pdf will be extracted to
txt <- "emails.text" # directory where txt will be extracted to
# create directories
dir.create(pdf)
dir.create(emails_bengh)
unzip(dl, exdir = pdf)
# get emails released in august
extract_emails(pdf, save.dir = txt, extractor = ext)
Now we can read the .txt
files in R to a named list where the each email is named after its corresponding file.
contents <- load_emails(emails_bengh)
You can clean the emails with clean_content
it’ll remove some comments and other unwanted lines.
cont <- get_content(contents)
cont <- clean_content(cont)
get_content