library(sdglinkage)
set.seed(1234)
In this vignette, we show how we can use sdglinkage to generate a synthetic version of sensitive identifiers for linkage methods development. This is particularly useful for people from a trusted third party that has the access to sensitive identifiers such as names and ID numbers and would like to share a synthetic yet realistic version of the identifiers to a wider audience (e.g. ALSPAC dataset). For people that has the access to sensitive identifiers please see vignette Generation_of_Gold_Standard_File_and_Linkage_Files.
For confidentiality reasons, we are unable to release our experimental datasets, instead, for demonstration purpose, we create two versions of identifier datasets and consider them as our ‘real’ datasets.
This is what the real_gsf looks like
real_gsf <- data.frame(sex=sample(c('male', 'female'), 100, replace = TRUE))
real_gsf <- add_variable(real_gsf, "nhsid")
real_gsf <- real_gsf[,c(2, 1)]
real_gsf$race <- sample(1:6, 100, replace = TRUE)
real_gsf <- add_variable(real_gsf, "dob", age_dependency = FALSE)
real_gsf <- add_variable(real_gsf, "firstname", country = "uk", gender_dependency= TRUE, age_dependency = TRUE)
real_gsf <- add_variable(real_gsf, "lastname", country = "uk")
head(real_gsf)
## nhsid sex race dob firstname lastname
## 1 0521968372 male 1 1936-11-05 alexander kiely
## 2 0698713257 female 1 1977-05-09 nikita veasey
## 3 1756849021 female 1 1945-11-01 lydia fibbs
## 4 5426301786 female 2 1971-03-01 chloe platts
## 5 9586341704 female 4 1995-05-15 louise hayhurst
## 6 6948320511 female 1 1936-01-15 chloe conroy
This is what the real_lf looks like. We can see some errors here such as in row3 the race is missing and dob was transposed from ‘1945-11-01’ to ‘1945-01-11’ and in row4 the name ‘charlotte’ was entered as its variant ‘carlotta’.
error_occurrence_flags <- data.frame(tmp=character(100))
error_occurrence_flags <- add_random_error(error_occurrence_flags, prob = c(0.90, 0.10), "race_missing")
error_occurrence_flags <- add_random_error(error_occurrence_flags, prob = c(0.55, 0.45), "dob_trans_date")
error_occurrence_flags <- add_random_error(error_occurrence_flags, prob = c(0.65, 0.35), "firstname_variant")
error_occurrence_flags <- add_random_error(error_occurrence_flags, prob = c(0.75, 0.25), "lastname_typo")
error_occurrence_flags$tmp <-NULL
real_lf <- damage_gold_standard(real_gsf, error_occurrence_flags)$linkage_file
## encoding error to: race_missing_flag
## encoding error to: dob_trans_date_flag
## encoding error to: firstname_variant_flag
## encoding error to: lastname_typo_flag
head(real_lf)
## nhsid sex race dob firstname lastname
## 1 0521968372 male 1 1936-11-05 alexander kieky
## 2 0698713257 female 1 1977-05-09 nikita veasey
## 3 1756849021 female NA 1945-01-11 lydia fibbs
## 4 5426301786 female 2 1971-03-01 chloe_lack_of_record platts
## 5 9586341704 female 4 1995-05-15 louise hayhurst
## 6 6948320511 female 1 1936-15-01 chloe_lack_of_record conroy
In the real world, we often do not know where the errors were recorded in the dataset. For a less maintained dataset, we have to manually compare its identifiers with the identifiers from the reference dataset. The clerical work is usually tedious and error-prone.
In this section, we show how to use sdglinkage to detect the inconsistency between real_lf and real_gsf and to classify the errors into different error categories.
Here we use nhsid as our unique identifiers to link real_gsf and real_lf and compares variables race, dob, firstname and lastname.
vars = list(c('race', 'race'), c('dob', 'dob'), c('firstname', 'firstname'), c('lastname', 'lastname'))
diffs.table = compare_two_df(real_gsf, real_lf, vars, 'nhsid')
diffs.table
## var.x var.y nhsid values.x values.y row.x row.y
## 1 race race 0832751693 3 NA 51 51
## 2 race race 174039562 6 NA 41 41
## 3 race race 1756849021 1 NA 3 3
## 4 race race 1850293465 2 NA 71 71
## 5 race race 5120983464 3 NA 36 36
## 6 race race 7143680591 4 NA 37 37
## 7 dob dob 0923168575 1986-12-17 1986-17-12 59 59
## 8 dob dob 1756849021 1945-11-01 1945-01-11 3 3
## 9 dob dob 1803529768 1929-06-24 1929-24-06 62 62
## 10 dob dob 204189735 1994-01-17 1994-17-01 17 17
## 11 dob dob 2053841966 1960-12-15 1960-15-12 8 8
## 12 dob dob 2057136485 2005-05-08 2005-08-05 46 46
## 13 dob dob 245071698 1960-04-21 1960-21-04 57 57
## 14 dob dob 2680473515 1957-07-20 1957-20-07 68 68
## 15 dob dob 2846701393 1906-11-06 1906-06-11 56 56
## 16 dob dob 3506742817 1932-09-14 1932-14-09 10 10
## 17 dob dob 3709241863 1940-02-18 1940-18-02 58 58
## 18 dob dob 4267518300 2002-03-12 2002-12-03 73 73
## 19 dob dob 4361879206 1914-04-19 1914-19-04 50 50
## 20 dob dob 4706352185 1923-03-20 1923-20-03 42 42
## 21 dob dob 4836019257 1908-08-01 1908-01-08 81 81
## 22 dob dob 4960238178 1931-08-24 1931-24-08 31 31
## 23 dob dob 5283174964 2008-05-27 2008-27-05 70 70
## 24 dob dob 5346879129 1962-09-19 1962-19-09 85 85
## 25 dob dob 5401873697 1936-06-03 1936-03-06 97 97
## 26 dob dob 5971084267 1909-05-10 1909-10-05 79 79
## 27 dob dob 6047953816 1908-08-12 1908-12-08 39 39
## 28 dob dob 6349518020 1976-03-31 1976-31-03 53 53
## 29 dob dob 6948320511 1936-01-15 1936-15-01 6 6
## 30 dob dob 7431625898 1963-05-16 1963-16-05 83 83
## 31 dob dob 7493562083 1966-04-28 1966-28-04 33 33
## 32 dob dob 7895412604 1960-10-07 1960-07-10 34 34
## 33 dob dob 8349720514 1927-03-10 1927-10-03 14 14
## 34 dob dob 853617420 1989-01-12 1989-12-01 20 20
## 35 dob dob 8690413251 2014-06-08 2014-08-06 40 40
## 36 dob dob 910268537 2016-06-04 2016-04-06 32 32
## 37 dob dob 9187263041 2014-11-26 2014-26-11 61 61
## 38 dob dob 9271345800 1979-09-16 1979-16-09 77 77
## 39 dob dob 9273156088 1915-01-29 1915-29-01 27 27
## 40 dob dob 9360481521 1930-01-21 1930-21-01 93 93
## 41 dob dob 9467821052 1951-05-17 1951-17-05 21 21
## 42 dob dob 9821670350 1910-06-01 1910-01-06 100 100
## 43 firstname firstname 0571948324 charlie charlee 90 90
## 44 firstname firstname 0739814257 elliot elliot_l.... 87 87
## 45 firstname firstname 174039562 siobhan siobhan_.... 41 41
## 46 firstname firstname 204189735 zachary zachariah 17 17
## 47 firstname firstname 2548970612 ben benji 64 64
## 48 firstname firstname 2846701393 chloe chloe_la.... 56 56
## 49 firstname firstname 2846951306 jake jake_lac.... 7 7
## 50 firstname firstname 2874560138 steven stevan 48 48
## 51 firstname firstname 2961547032 isabella isobelle 44 44
## 52 firstname firstname 4102376585 madeleine magda 60 60
## 53 firstname firstname 4361879206 heather heather_.... 50 50
## 54 firstname firstname 4836019257 anouska anouska_.... 81 81
## 55 firstname firstname 5014268390 alex alexys 67 67
## 56 firstname firstname 5283174964 dia dia_lack.... 70 70
## 57 firstname firstname 5346879129 alexander lexi 85 85
## 58 firstname firstname 5401873697 amin amin_lac.... 97 97
## 59 firstname firstname 5426301786 chloe chloe_la.... 4 4
## 60 firstname firstname 5971084267 alexander aleksandar 79 79
## 61 firstname firstname 6321408751 leon lennie 95 95
## 62 firstname firstname 6327854915 bethany beth 80 80
## 63 firstname firstname 6584017397 alexander aleksander 49 49
## 64 firstname firstname 6948320511 chloe chloe_la.... 6 6
## 65 firstname firstname 7109324656 alexander aleksandr 78 78
## 66 firstname firstname 7159842367 zac zach 89 89
## 67 firstname firstname 7462018594 jake jake_lac.... 13 13
## 68 firstname firstname 7493562083 tom thomas 33 33
## 69 firstname firstname 8402915736 chloe chloe_la.... 92 92
## 70 firstname firstname 9187263041 megan meghan 61 61
## 71 firstname firstname 9271345800 jake jake_lac.... 77 77
## 72 firstname firstname 9340862716 blake blayke 65 65
## 73 firstname firstname 9360481521 ian aian 93 93
## 74 firstname firstname 9562843017 bethany beth 66 66
## 75 firstname firstname 985743021 lauren laurel 29 29
## 76 lastname lastname 0382179544 gilson giloon 72 72
## 77 lastname lastname 0521968372 kiely kieky 1 1
## 78 lastname lastname 0739814257 millar millvr 87 87
## 79 lastname lastname 2846951306 thacker thafker 7 7
## 80 lastname lastname 2961547032 mcbain mcbaun 44 44
## 81 lastname lastname 4102376585 gorman gprman 60 60
## 82 lastname lastname 4267518300 mitchell mihchell 73 73
## 83 lastname lastname 5120983464 woods eoods 36 36
## 84 lastname lastname 5283174964 tinkler tinkoer 70 70
## 85 lastname lastname 5904371284 harrison harrispn 74 74
## 86 lastname lastname 5971084267 walmsley walms4ey 79 79
## 87 lastname lastname 6210975836 howe yowe 28 28
## 88 lastname lastname 6327854915 dempsey dempsuy 80 80
## 89 lastname lastname 6804731255 ross rosz 91 91
## 90 lastname lastname 6815243974 moore moofe 43 43
## 91 lastname lastname 7042953819 willett willeqt 30 30
## 92 lastname lastname 7325960840 bell bekl 9 9
## 93 lastname lastname 8402915736 cox cos 92 92
## 94 lastname lastname 853617420 hobbs hob9s 20 20
## 95 lastname lastname 9340862716 harrington hajrington 65 65
## 96 lastname lastname 9467821052 goff hoff 21 21
Here we show how to append error flags into real_gsf based on the difference between real_gsf and real_lf.
Here we detect if there is missing in the race variable, if yes, the individual will be flagged as 1 in the newly built race_missing_flag variable. The same princeple applies to the rest of the errors and variables.
real_gsf_with_flags = acquire_error_flag(real_gsf, diffs.table, 'race', 'missing')
real_gsf_with_flags = acquire_error_flag(real_gsf_with_flags, diffs.table, 'dob', 'trans_date')
real_gsf_with_flags = acquire_error_flag(real_gsf_with_flags, diffs.table, 'firstname', 'variant')
real_gsf_with_flags = acquire_error_flag(real_gsf_with_flags, diffs.table, 'lastname', 'typo')
error_occurrence_flags is the error we enter when creating the ‘real_lf’, and acquired_error_flags is the extracted and classified error from the ‘real_lf’.
head(error_occurrence_flags)
## race_missing_flag dob_trans_date_flag firstname_variant_flag
## 1 0 0 0
## 2 0 0 0
## 3 1 1 0
## 4 0 0 1
## 5 0 0 0
## 6 0 1 1
## lastname_typo_flag
## 1 1
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
acquired_error_flags = real_gsf_with_flags[grep('flag', colnames(real_gsf_with_flags))]
head(acquired_error_flags)
## race_missing_flag dob_trans_date_flag firstname_variant_flag
## 1 0 0 0
## 2 0 0 0
## 3 1 1 0
## 4 0 0 1
## 5 0 0 0
## 6 0 1 1
## lastname_typo_flag
## 1 1
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
Let’s compare the acquired_error_flags with error_occurrence_flags: if they are completely identical, then it means that our method has successfully extracted and classified the errors happened in the real_lf into the correct categories.
all.equal(acquired_error_flags, error_occurrence_flags)
## [1] "Component \"dob_trans_date_flag\": 1 string mismatch"
## [2] "Component \"lastname_typo_flag\": 2 string mismatches"
There is one mismatch in the dob_trans_date_flag column, which is because the dob was the same after transposed.
real_gsf[c(28),]
## nhsid sex race dob firstname lastname
## 28 6210975836 female 2 1992-11-11 chloe howe
real_lf[c(28),]
## nhsid sex race dob firstname lastname
## 28 6210975836 female 2 1992-11-11 chloe yowe
Let’s fix it and we can see they are identifical now.
error_occurrence_flags$dob_trans_date_flag[c(28)] = 0
all.equal(acquired_error_flags, error_occurrence_flags)
## [1] "Component \"lastname_typo_flag\": 2 string mismatches"
Even though the synthetic data we generate later will be sampled from the generator, that means, the data is fully synthesised and cannot be linked back to real-world identifiers. But the sampling of sensitive variables such as names is from the real dataset and can be worrisome for some parties. Therefore, we also provide a function to replace these sensitive variables with variables from another database.
Previously, we generate the ‘real_gsf’ with firstname from the uk population that depends on the individual’s gender and age, here we show how we can replace them with firstname from us population that depends on the individual’s gender and race. We also replace the lastname from us population and randomly assign a new nhsid to each individual.
real_gsf_with_flags_replaced = replace_firstname(real_gsf_with_flags, country = 'us', gender_dependency = TRUE, race_dependency = TRUE)
real_gsf_with_flags_replaced = replace_lastname(real_gsf_with_flags_replaced, country = 'us', race_dependency = TRUE)
real_gsf_with_flags_replaced = replace_nhsid(real_gsf_with_flags_replaced)
This is what the original dataset looks like:
head(real_gsf_with_flags[colnames(real_gsf)])
## nhsid sex race dob firstname lastname
## 1 0521968372 male 1 1936-11-05 alexander kiely
## 2 0698713257 female 1 1977-05-09 nikita veasey
## 3 1756849021 female 1 1945-11-01 lydia fibbs
## 4 5426301786 female 2 1971-03-01 chloe platts
## 5 9586341704 female 4 1995-05-15 louise hayhurst
## 6 6948320511 female 1 1936-01-15 chloe conroy
This is what the replaced dataset looks like:
head(real_gsf_with_flags_replaced[colnames(real_gsf)])
## nhsid sex race dob firstname lastname
## 1 9521047364 male 1 1936-11-05 christopher watts
## 2 0938425714 female 1 1977-05-09 olivia adams
## 3 8240139655 female 1 1945-11-01 courtney davis
## 4 7062843598 female 2 1971-03-01 emily chung
## 5 3254169808 female 4 1995-05-15 ashley salazar
## 6 1827306947 female 1 1936-01-15 samantha wilson
Here we show how to use our generator to generate synthetic identifiers from the real_gsf_with_flags_replaced. More details about the performance of the generator please see vignette Synthetic_Data_Generation_and_Evaluation.
# Here we set the variables into the right format for the generator
real_gsf_with_flags_replaced[colnames(real_gsf_with_flags_replaced)] <- lapply(real_gsf_with_flags_replaced[colnames(real_gsf_with_flags_replaced)], factor)
# We use learned bn to train a generator
bn_learn <- gen_bn_learn(real_gsf_with_flags_replaced, "hc")
# syn_gsf is the generated synthetic gold standard file
syn_gsf = bn_learn$gen_data
head(syn_gsf)
## nhsid sex race dob firstname lastname race_missing_flag
## 1 8192354679 female 5 1936-06-03 kaitlyn wright 0
## 2 3689124751 male 1 1995-05-15 jumaana wall 0
## 3 1827306947 female 6 1979-10-27 emily flores 0
## 4 0759186340 female 3 1942-03-30 jasmine kang 0
## 5 295734601 female 1 1982-03-03 kayla romero 0
## 6 0826359140 male 3 1945-11-01 kevin reed 0
## dob_trans_date_flag firstname_variant_flag lastname_typo_flag
## 1 1 0 0
## 2 0 1 0
## 3 0 0 1
## 4 0 0 0
## 5 0 0 0
## 6 1 1 0
# syn_lf1 and syn_lf2 are the synthetic linkage files that were damaged by the inferred error occurrence in the syn_gsf
syn_error_occurrence_1 <- bn_flag_inference(syn_gsf, bn_learn$fit_model)
syn_lf1 <- damage_gold_standard(syn_gsf, syn_error_occurrence_1)
## encoding error to: race_missing_flag
## encoding error to: dob_trans_date_flag
## encoding error to: firstname_variant_flag
## encoding error to: lastname_typo_flag
head(syn_lf1$linkage_file)
## nhsid sex race dob firstname lastname
## 1 8192354679 female 5 1936-06-03 kaitlyn wright
## 2 3689124751 male 1 1995-05-15 jumaana_lack_of_record wall
## 3 1827306947 female 6 1979-27-10 emily flores
## 4 0759186340 female 3 1942-03-30 jasmine kang
## 5 295734601 female 1 1982-03-03 koula romerp
## 6 0826359140 male 3 1945-01-11 kevin reed
## race_missing_flag dob_trans_date_flag firstname_variant_flag
## 1 0 1 0
## 2 0 0 1
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 1 1
## lastname_typo_flag
## 1 0
## 2 0
## 3 1
## 4 0
## 5 0
## 6 0
syn_error_occurrence_2 <- bn_flag_inference(syn_gsf, bn_learn$fit_model)
syn_lf2 <- damage_gold_standard(syn_gsf, syn_error_occurrence_2)
## encoding error to: race_missing_flag
## encoding error to: dob_trans_date_flag
## encoding error to: firstname_variant_flag
## encoding error to: lastname_typo_flag
head(syn_lf2$linkage_file)
## nhsid sex race dob firstname lastname
## 1 8192354679 female 5 1936-06-03 kaitlyn sright
## 2 3689124751 male 1 1995-05-15 jumaana_lack_of_record walk
## 3 1827306947 female 6 1979-10-27 emily flores
## 4 0759186340 female 3 1942-30-03 jasmin kang
## 5 295734601 female 1 1982-03-03 kayla romero
## 6 0826359140 male 3 1945-01-11 kevin reed
## race_missing_flag dob_trans_date_flag firstname_variant_flag
## 1 0 1 0
## 2 0 0 1
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 1 1
## lastname_typo_flag
## 1 0
## 2 0
## 3 1
## 4 0
## 5 0
## 6 0
Here we give an example of how the generated linkage files can be used for linkage evaluation.
library(reclin)
library(dplyr)
linked_data_set <- pair_blocking(syn_lf1$linkage_file, syn_lf2$linkage_file, "dob") %>%
compare_pairs(by = c("lastname", "firstname", "sex", "race"),
default_comparator = jaro_winkler(0.8)) %>%
score_problink(var = "weight") %>%
select_n_to_m("weight", var = "ntom", threshold = 0) %>%
link()
We can see out of 100 individuals, there are only 59 are matched using the method from reclin. This is because the block variable ‘dob’ itself is unreliable as 45% of them has transposed date error.
Among the 59 matched records, 56 of them are true match and 3 of them are mismatched:
table(linked_data_set$nhsid.x == linked_data_set$nhsid.y)
##
## FALSE TRUE
## 1 53
head(linked_data_set[linked_data_set$nhsid.x != linked_data_set$nhsid.y,],3)
## nhsid.x sex.x race.x dob.x firstname.x lastname.x
## 22 9521047364 female 5 1936-01-15 alex el-baccus
## NA <NA> <NA> <NA> <NA> <NA> <NA>
## NA.1 <NA> <NA> <NA> <NA> <NA> <NA>
## race_missing_flag.x dob_trans_date_flag.x firstname_variant_flag.x
## 22 0 0 0
## NA <NA> <NA> <NA>
## NA.1 <NA> <NA> <NA>
## lastname_typo_flag.x nhsid.y sex.y race.y dob.y firstname.y
## 22 0 7241956087 male 5 1936-01-15 steven
## NA <NA> <NA> <NA> <NA> <NA> <NA>
## NA.1 <NA> <NA> <NA> <NA> <NA> <NA>
## lastname.y race_missing_flag.y dob_trans_date_flag.y
## 22 watts 0 0
## NA <NA> <NA> <NA>
## NA.1 <NA> <NA> <NA>
## firstname_variant_flag.y lastname_typo_flag.y
## 22 0 0
## NA <NA> <NA>
## NA.1 <NA> <NA>