library(reclin)
library(dplyr)
Using reclin
for deduplication will be demonstrated using an example
Towns with names containing 'rdam' or 'rdm' have been selected. This should contain most records concerning the two largest cities in The Netherlands: Amsterdam and Rotterdam.
data("town_names")
head(town_names)
#> name official_name
#> 1 alblasserdam Alblasserdam
#> 2 amsterdam Amsterdam
#> 3 amsterdam-z.o. Amsterdam
#> 4 amsterdam-zuidoost Amsterdam
#> 5 amsterdam z-o Amsterdam
#> 6 amsterdam z.o. Amsterdam
First, we do a little bit of cleanup of the names. We have a lot of names of the form 'amsterdam z.o.', 'amsterdam zo', etc. Removing non-alphanumeric characters will probably help. Also, some of the O's are written as 0's (zeros).
town_names$clean_name <- gsub("[^[:alnum:]]", "", town_names$name)
town_names$clean_name <- gsub("0", "o", town_names$clean_name)
We will now compare all records from the dataset to each other. First, we
generate all possible pairs of records. However, as it is not necessary to
compare the first record to the second and also the second to the first, we will
only select pairs for which the second index is larger than the first. This is
done by filter_pairs_for_deduplication
. We compare the names using
jaro_winkler
and select records for which the Jaro-Winkler similarity is above
0.88. This value is determined by eye-balling the data. Usually values around
0.9 work.
p <- pair_blocking(town_names, town_names) %>%
filter_pairs_for_deduplication() %>%
compare_pairs("clean_name", default_comparator = jaro_winkler()) %>%
score_simsum() %>%
select_threshold(0.88)
head(p)
#> ldat with 170 236 rows and 5 columns
#> x y clean_name simsum select
#> 1 1 2 0.6679894 0.6679894 FALSE
#> 2 1 3 0.6208514 0.6208514 FALSE
#> 3 2 3 0.9393939 0.9393939 TRUE
#> 4 1 4 0.5459851 0.5459851 FALSE
#> 5 2 4 0.8431373 0.8431373 FALSE
#> 6 3 4 0.8823529 0.8823529 TRUE
#> 7 1 5 0.6208514 0.6208514 FALSE
#> 8 2 5 0.9393939 0.9393939 TRUE
#> 9 3 5 1.0000000 1.0000000 TRUE
#> 10 4 5 0.8823529 0.8823529 TRUE
#> : : : : : :
#> 170227 574 584 0.8000000 0.8000000 FALSE
#> 170228 575 584 0.9333333 0.9333333 TRUE
#> 170229 576 584 0.7544974 0.7544974 FALSE
#> 170230 577 584 0.7277778 0.7277778 FALSE
#> 170231 578 584 0.6110294 0.6110294 FALSE
#> 170232 579 584 0.7073593 0.7073593 FALSE
#> 170233 580 584 0.5111111 0.5111111 FALSE
#> 170234 581 584 0.5296296 0.5296296 FALSE
#> 170235 582 584 0.7357298 0.7357298 FALSE
#> 170236 583 584 0.9296296 0.9296296 TRUE
We have now selected some town names that we consider the same: records 2 and 3 (record 3 in output above) are the same, and records 3 and 4 (record 6). However, records 2 and 4 are not classified as belonging to the same record (record 5).
In our final step we want to assign each record in our original data set town_names
into a number of groups, each group containing all records with the same town
names. The function deduplicate_equivalance
does that. It will use the 'rules'
derived above: 2 and 3 belong to the same group
,
3 and 4 belong to the same group
, etc., to assign each record to a group. It
will, therefore, also assign records 2 and 4 to the same group. For those
familiar with graph theory: it derives all subgraphs and assigns all nodes in a
subgraph the same identifier.
res <- deduplicate_equivalence(p)
head(res)
#> name official_name clean_name duplicate_groups
#> 1 alblasserdam Alblasserdam alblasserdam 541
#> 2 amsterdam Amsterdam amsterdam 581
#> 3 amsterdam-z.o. Amsterdam amsterdamzo 581
#> 4 amsterdam-zuidoost Amsterdam amsterdamzuidoost 581
#> 5 amsterdam z-o Amsterdam amsterdamzo 581
#> 6 amsterdam z.o. Amsterdam amsterdamzo 581
As we can see records 2 to 6 are assigned to the same group. We can calculate the number of groups and compare that to the original number of town names:
length(unique(res$duplicate_groups))
#> [1] 53
length(unique(res$duplicate_groups))/nrow(res)
#> [1] 0.09075342
We are only left with 53 town names; a reduction of approximately 90 percent. For this small number of remaining groups it is possible to manually derive the correct names, or, if that would be available, we could use the most frequent name in each group as the group name.
Lets assume that we are able to correctly determine the group names. This means that we assign the most frequent official name to each group:
res <- res %>% group_by(duplicate_groups, official_name) %>% mutate(n = n()) %>%
group_by(duplicate_groups) %>%
mutate(group_name = first(official_name, order_by = desc(n)))
We can then calculate the confusion matrix and calculate the precision and recall:
precision <- res %>% group_by(group_name) %>%
summarise(precision = sum(group_name == official_name)/n())
precision_recall <- res %>% group_by(official_name) %>%
summarise(recall = sum(group_name == official_name)/n()) %>%
left_join(precision, by = c("official_name" = "group_name")) %>%
mutate(precision = ifelse(is.na(precision), 0, precision))
precision_recall
#> # A tibble: 19 x 3
#> official_name recall precision
#> <fct> <dbl> <dbl>
#> 1 Alblasserdam 1 1
#> 2 Amsterdam 0.997 0.964
#> 3 Amsterdam-Duivendrecht 1 1
#> 4 Botlek Rotterdam 0.5 1
#> 5 Diemen 0.167 1
#> 6 Europoort Rotterdam 1 1
#> 7 Hoogvliet Rotterdam 0.725 1
#> 8 Leerdam 1 0.778
#> 9 Naarden 1 1
#> 10 Nieuw Amsterdam 0.75 1
#> 11 Pernis Rotterdam 0.688 1
#> 12 Rotterdam 0.993 0.876
#> 13 Rotterdam Albrandswaard 1 1
#> 14 Rotterdam Poortugaal 1 1
#> 15 Rozenburg 1 1
#> 16 Schiedam 0 0
#> 17 Spaarndam 1 1
#> 18 Veendam 0 0
#> 19 Zwammerdam 0.4 1
Overall precision and recall
summarise(precision_recall, mean(recall), mean(precision))
#> # A tibble: 1 x 2
#> `mean(recall)` `mean(precision)`
#> <dbl> <dbl>
#> 1 0.748 0.875