Here we’ll examine an example application of the widyr package, particularly the pairwise_cor
and pairwise_dist
functions. We’ll use the data on United Nations General Assembly voting from the unvotes
package:
library(dplyr)
library(unvotes)
un_votes
## # A tibble: 738,764 x 4
## rcid country country_code vote
## <int> <chr> <chr> <fct>
## 1 3 United States of America US yes
## 2 3 Canada CA no
## 3 3 Cuba CU yes
## 4 3 Haiti HT yes
## 5 3 Dominican Republic DO yes
## 6 3 Mexico MX yes
## 7 3 Guatemala GT yes
## 8 3 Honduras HN yes
## 9 3 El Salvador SV yes
## 10 3 Nicaragua NI yes
## # … with 738,754 more rows
This dataset has one row for each country for each roll call vote. We’re interested in finding pairs of countries that tended to vote similarly.
Notice that the vote
column is a factor, with levels (in order) “yes”, “abstain”, and “no”:
levels(un_votes$vote)
## [1] "yes" "abstain" "no"
We may then be interested in obtaining a measure of country-to-country agreement for each vote, using the pairwise_cor
function.
library(widyr)
cors <- un_votes %>%
mutate(vote = as.numeric(vote)) %>%
pairwise_cor(country, rcid, vote, use = "pairwise.complete.obs", sort = TRUE)
cors
## # A tibble: 39,800 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 Slovakia Czech Republic 0.989
## 2 Czech Republic Slovakia 0.989
## 3 Lithuania Estonia 0.971
## 4 Estonia Lithuania 0.971
## 5 Lithuania Latvia 0.970
## 6 Latvia Lithuania 0.970
## 7 Germany Liechtenstein 0.968
## 8 Liechtenstein Germany 0.968
## 9 Slovakia Slovenia 0.966
## 10 Slovenia Slovakia 0.966
## # … with 39,790 more rows
We could, for example, find the countries that the US is most and least in agreement with:
US_cors <- cors %>%
filter(item1 == "United States of America")
# Most in agreement
US_cors
## # A tibble: 199 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 United States of America United Kingdom of Great Britain and Northern Ireland 0.576
## 2 United States of America Canada 0.559
## 3 United States of America Israel 0.540
## 4 United States of America Netherlands 0.515
## 5 United States of America Luxembourg 0.505
## 6 United States of America Australia 0.502
## 7 United States of America Belgium 0.496
## 8 United States of America Italy 0.467
## 9 United States of America New Zealand 0.458
## 10 United States of America Japan 0.458
## # … with 189 more rows
# Least in agreement
US_cors %>%
arrange(correlation)
## # A tibble: 199 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 United States of America Belarus -0.358
## 2 United States of America Czechoslovakia -0.330
## 3 United States of America Cuba -0.306
## 4 United States of America Russian Federation -0.301
## 5 United States of America Egypt -0.247
## 6 United States of America India -0.243
## 7 United States of America Syrian Arab Republic -0.238
## 8 United States of America Afghanistan -0.229
## 9 United States of America Ukraine -0.225
## 10 United States of America Yemen Arab Republic -0.224
## # … with 189 more rows
This can be particularly useful when visualized on a map.
library(maps)
library(fuzzyjoin)
library(countrycode)
library(ggplot2)
world_data <- map_data("world") %>%
regex_full_join(iso3166, by = c("region" = "mapname")) %>%
filter(region != "Antarctica")
US_cors %>%
mutate(a2 = countrycode(item2, "country.name", "iso2c")) %>%
full_join(world_data, by = "a2") %>%
ggplot(aes(long, lat, group = group, fill = correlation)) +
geom_polygon(color = "gray", size = .1) +
scale_fill_gradient2() +
coord_quickmap() +
theme_void() +
labs(title = "Correlation of each country's UN votes with the United States",
subtitle = "Blue indicates agreement, red indicates disagreement",
fill = "Correlation w/ US")
Another useful kind of visualization is a network plot, which can be created with Thomas Pedersen’s ggraph package. We can filter for pairs of countries with correlations above a particular threshold.
library(ggraph)
library(igraph)
cors_filtered <- cors %>%
filter(correlation > .6)
continents <- tibble(country = unique(un_votes$country)) %>%
filter(country %in% cors_filtered$item1 |
country %in% cors_filtered$item2) %>%
mutate(continent = countrycode(country, "country.name", "continent"))
set.seed(2017)
cors_filtered %>%
graph_from_data_frame(vertices = continents) %>%
ggraph() +
geom_edge_link(aes(edge_alpha = correlation)) +
geom_node_point(aes(color = continent), size = 3) +
geom_node_text(aes(label = name), check_overlap = TRUE, vjust = 1, hjust = 1) +
theme_void() +
labs(title = "Network of countries with correlated United Nations votes")
Choosing the threshold for filtering correlations (or other measures of similarity) typically requires some trial and error. Setting too high a threshold will make a graph too sparse, while too low a threshold will make a graph too crowded.