library(sdglinkage)
set.seed(1234)
In this vignette, we show how we can use sdglinkage to generate a realistic synthetic gold standard file and how to damage the gold standard file into multiple copies of linkage files that can be used for linkage research.
Usually, when a trusted third party release a dataset to a research organisation, they will remove sensitive identifiers to prevent the data to be linked back to real individuals. In the meanwhile, for research purpose, some organisations will publish the error statistics happened in their dataset. This vignette targets for people from a research organisation that has access to non-sensitive predictor variables and statistics of the error occurred to both predictor variables and sensitive identifiers. And they would like to share a synthetic gold standard and linkage files of this dataset with realistic identifiers to a wider audience. For people from a trusted third party that has the access to sensitive identifiers please see vignette From_Sensitive_Real_Identifiers_to_Synthetic_Identifiers.
A gold standard file that gives us the true values of variables of interest, and linkage files that mimic the original error formats of real data. The following figure outlines the framework in generating these two types of files. In this example, we assume we have access to predictor variables such as sex, age and ethnicity but not sensitive identifiers such as nhsid and names. We also assume to know the errors occurred in all variables. We simulate three types of variables, which includes predictor variables learned together with the encoded error flags, external dependent identifiers and independent identifiers. These generated synthetic variables are then merged into a gold standard file and further damaged by inferred synthetic errors, which give us synthetic linkage files.
Generating gold standard file and linkage files.
A gold standard file consists of predictor variables and identifier variables. In this section we should how we can generate synthetic predictor variables that we have access to and append them with synthetic external dependent identifiers and independent identifiers that we do not have access to.
If we know where and what type of error had happened in the dataset, we would like to encode the position and type of error using one-hot encoding. For example, flag Edward’s name variant error as ‘1’ if his name record was typed as ‘Eddy’. This gives us a training dataset with one-hot encoded error flags. Bear in mind that we should build the error flags for all variables of interests, including independent identifiers - because even though the values of independent identifiers are independent of the values of other variables, the occurrence of errors of them can depend on the values or occurrence of errors of other variables. For example, a person from a minority group is more likely to have a missing value in the nhsid variable than those were born in the UK. This results in a training dataset with many error flags together with the values of the dependent variables.
In vignette From_Sensitive_Real_Identifiers_to_Synthetic_Identifiers we show how to extract these errors if we have access to the corrupt files. In this subsection, we show how to encode these error flags in case we do not have access to the error but have information about the error statistics.
We use the ‘age’, ‘race’ and ‘sex’ variables from the ‘Adult’ dataset as an example of real predictor variables. Meanwhile, we know the statistics and errors consist in our target dataset, e.g. 30% of the age is missing and 50% of the race is missing.
real_gsf = adult[c('age', 'race', 'sex')][1:3000,]
real_gsf_with_flags <- add_random_error(real_gsf, prob = c(0.70, 0.30), "age_missing")
real_gsf_with_flags <- add_random_error(real_gsf_with_flags, prob = c(0.50, 0.50), "race_missing")
real_gsf_with_flags <- add_random_error(real_gsf_with_flags, prob = c(0.65, 0.35), "sex_missing")
real_gsf_with_flags <- add_random_error(real_gsf_with_flags, prob = c(0.90, 0.10), "postcode_trans_char")
real_gsf_with_flags <- add_random_error(real_gsf_with_flags, prob = c(0.50, 0.50), "firstname_variant")
real_gsf_with_flags <- add_random_error(real_gsf_with_flags, prob = c(0.50, 0.50), "lastname_variant")
real_gsf_with_flags <- add_random_error(real_gsf_with_flags, prob = c(0.50, 0.50), "firstname_typo")
real_gsf_with_flags <- add_random_error(real_gsf_with_flags, prob = c(0.50, 0.50),"firstname_pho")
real_gsf_with_flags <- add_random_error(real_gsf_with_flags, prob = c(0.50, 0.50), "firstname_ocr")
real_gsf_with_flags <- add_random_error(real_gsf_with_flags, prob = c(0.50, 0.50),"firstname_trans_char")
head(real_gsf_with_flags)
## age race sex age_missing_flag race_missing_flag sex_missing_flag
## 1 39 White Male 0 0 0
## 2 50 White Male 0 1 0
## 3 38 White Male 0 1 1
## 4 53 Black Male 0 1 0
## 5 28 Black Female 1 1 0
## 6 37 White Female 0 1 0
## postcode_trans_char_flag firstname_variant_flag lastname_variant_flag
## 1 0 0 1
## 2 0 0 1
## 3 0 1 1
## 4 0 0 1
## 5 1 1 1
## 6 1 0 1
## firstname_typo_flag firstname_pho_flag firstname_ocr_flag
## 1 0 0 0
## 2 1 0 0
## 3 0 1 1
## 4 0 0 1
## 5 1 1 1
## 6 0 1 0
## firstname_trans_char_flag
## 1 1
## 2 1
## 3 0
## 4 1
## 5 1
## 6 0
We use BNs to learn the dependency and parameters of the training dataset and sample data from the trained model. The generated data not only preserves the relationships and statistics between the variables, but also the occurrence of errors (this is useful for the inference for linkage file that is introduced later).
bn_learn <- gen_bn_learn(real_gsf_with_flags, "hc")
syn_dependent <- bn_learn$gen_data[, !grepl("flag", colnames(bn_learn$gen_data))]
head(syn_dependent)
## age race sex
## 1 45.44689 Amer-Indian-Eskimo Male
## 2 18.04504 White Male
## 3 33.22080 White Male
## 4 32.03759 White Male
## 5 44.81910 White Female
## 6 30.49562 White Female
Here we randomly assign an nhsid and an address to each individual. nhsid is generated using the Modulus 11 Algorithm, and address is sampled from a real uk address database.
syn_gsf <- add_variable(syn_dependent, "nhsid")
syn_gsf <- add_variable(syn_gsf, "address")
syn_gsf$country <-NULL
syn_gsf$primary_care_trust <-NULL
syn_gsf$longitude <-NULL
syn_gsf$latitude <-NULL
head(syn_gsf)
## age race sex nhsid postcode
## 1 45.44689 Amer-Indian-Eskimo Male 6531420790 M22 5BF
## 2 18.04504 White Male 6749102387 NN7 4LT
## 3 33.22080 White Male 7543092867 SS13 1PZ
## 4 32.03759 White Male 3617948251 AB32 7AT
## 5 44.81910 White Female 4102587969 EX22 7BG
## 6 30.49562 White Female 7653401988 CF14 5FT
Firstname and lastname are two sensitive identifiers that are often removed when releasing the dataset to another organisation. But meanwhile, several organisations have published databases of names given different population. We make use of these resources and build a uk firstname database that depends on gender and age, uk lastname database, us firstname database that depends on gender and race and us lastname database that depends on the race.
Here we randomly assign a firstname and a lastname to an individual based on the value of gender and age and the frequency of the names. Firstname and lastname are sampled from a real uk database of baby birth name ranging from 1996 to 2018. Together with the synthetic predictors and independent identifiers, we have the synthetic gold standard file.
syn_gsf <- add_variable(syn_gsf, "firstname", country = "uk", gender_dependency = TRUE, age_dependency = TRUE)
syn_gsf <- add_variable(syn_gsf, "lastname", country = "uk")
head(syn_gsf)
## age race sex nhsid postcode firstname
## 1 45.44689 Amer-Indian-Eskimo Male 6531420790 M22 5BF tony
## 2 18.04504 White Male 6749102387 NN7 4LT oliver
## 3 33.22080 White Male 7543092867 SS13 1PZ steven
## 4 32.03759 White Male 3617948251 AB32 7AT leon
## 5 44.81910 White Female 4102587969 EX22 7BG chloe
## 6 30.49562 White Female 7653401988 CF14 5FT chloe
## lastname
## 1 barchard
## 2 ward
## 3 kirkbride
## 4 murden
## 5 munroe
## 6 terry
The linkage files are copies of the gold standard file that were damaged by several damage actions. In this section, we show how to generate two linkage files that can be used for linkage activity.
The error occurrence files are inferenced using the previously trained model based on the record of each individual. This gives us the guidance of the damage actions.
syn_error_occurrence1 <- bn_flag_inference(bn_learn$gen_data, bn_learn$fit_model)
head(syn_error_occurrence1)
## age_missing_flag race_missing_flag sex_missing_flag
## 1 0 0 0
## 2 0 0 0
## 3 1 0 1
## 4 0 1 0
## 5 0 0 0
## 6 0 1 1
## postcode_trans_char_flag firstname_variant_flag lastname_variant_flag
## 1 0 0 0
## 2 0 1 0
## 3 0 1 1
## 4 0 1 1
## 5 0 0 1
## 6 0 0 1
## firstname_typo_flag firstname_pho_flag firstname_ocr_flag
## 1 0 0 1
## 2 0 1 1
## 3 1 1 1
## 4 1 1 1
## 5 1 0 0
## 6 0 0 1
## firstname_trans_char_flag
## 1 1
## 2 1
## 3 0
## 4 0
## 5 0
## 6 0
syn_error_occurrence2 <- bn_flag_inference(bn_learn$gen_data, bn_learn$fit_model)
head(syn_error_occurrence2)
## age_missing_flag race_missing_flag sex_missing_flag
## 1 0 1 1
## 2 0 0 0
## 3 0 1 0
## 4 1 0 0
## 5 1 0 1
## 6 1 0 0
## postcode_trans_char_flag firstname_variant_flag lastname_variant_flag
## 1 1 0 1
## 2 0 1 0
## 3 0 1 1
## 4 0 0 0
## 5 0 0 1
## 6 0 1 0
## firstname_typo_flag firstname_pho_flag firstname_ocr_flag
## 1 0 0 0
## 2 0 1 0
## 3 1 0 1
## 4 1 0 0
## 5 1 1 0
## 6 1 1 0
## firstname_trans_char_flag
## 1 0
## 2 0
## 3 0
## 4 1
## 5 1
## 6 1
Here we damage the gold standard file based on the inferred occurrence of the errors.
syn_lf1 <- damage_gold_standard(syn_gsf, syn_error_occurrence1)
## encoding error to: age_missing_flag
## encoding error to: race_missing_flag
## encoding error to: sex_missing_flag
## encoding error to: postcode_trans_char_flag
## encoding error to: firstname_variant_flag
## encoding error to: lastname_variant_flag
## encoding error to: firstname_typo_flag
## encoding error to: firstname_pho_flag
## encoding error to: firstname_ocr_flag
## encoding error to: firstname_trans_char_flag
head(syn_lf1$linkage_file)
## age race sex nhsid postcode firstname
## 1 45.44689 Amer-Indian-Eskimo Male 6531420790 M22 5BF t0yn
## 2 18.04504 White Male 6749102387 NN7 4LT lo'l
## 3 NA White <NA> 7543092867 SS13 1PZ 5trephan
## 4 32.03759 <NA> Male 3617948251 AB32 7AT IeInard
## 5 44.81910 White Female 4102587969 EX22 7BG chlo0
## 6 30.49562 <NA> <NA> 7653401988 CF14 5FT cbloe
## lastname
## 1 barchard
## 2 ward
## 3 kirkbride_lack_of_record
## 4 murden_lack_of_record
## 5 munroe_lack_of_record
## 6 terry_lack_of_record
head(syn_lf1$error_log)
## age_missing_flag race_missing_flag sex_missing_flag
## 1 0 0 0
## 2 0 0 0
## 3 <NA> 0 <NA>
## 4 0 <NA> 0
## 5 0 0 0
## 6 0 <NA> <NA>
## postcode_trans_char_flag firstname_variant_flag
## 1 0 0
## 2 0 oliver>olie
## 3 0 steven>stephan
## 4 0 leon>leonhard
## 5 0 0
## 6 0 0
## lastname_variant_flag firstname_typo_flag
## 1 0 0
## 2 0 0
## 3 kirkbride>kirkbride_lack_of_record t<r<2
## 4 murden>murden_lack_of_record o<l<3
## 5 munroe>munroe_lack_of_record e<0<5
## 6 terry>terry_lack_of_record 0
## firstname_pho_flag firstname_ocr_flag firstname_trans_char_flag
## 1 0 o>0>all ny>trans>34
## 2 ie>i>all i>'l>all ol>trans>12
## 3 s>st>all s>5>all 0
## 4 h>@>all l>I>all 0
## 5 0 0 0
## 6 0 h>b>all 0
syn_lf2 <- damage_gold_standard(syn_gsf, syn_error_occurrence2)
## encoding error to: age_missing_flag
## encoding error to: race_missing_flag
## encoding error to: sex_missing_flag
## encoding error to: postcode_trans_char_flag
## encoding error to: firstname_variant_flag
## encoding error to: lastname_variant_flag
## encoding error to: firstname_typo_flag
## encoding error to: firstname_pho_flag
## encoding error to: firstname_ocr_flag
## encoding error to: firstname_trans_char_flag
head(syn_lf2$linkage_file)
## age race sex nhsid postcode firstname
## 1 45.44689 <NA> <NA> 6531420790 M225 BF tony
## 2 18.04504 White Male 6749102387 NN7 4LT olli
## 3 33.22080 <NA> Male 7543092867 SS13 1PZ sgepben
## 4 NA White Male 3617948251 AB32 7AT lemo
## 5 NA White <NA> 4102587969 EX22 7BG nkloe
## 6 NA White Female 7653401988 CF14 5FT chloe_lack_of_eecotr
## lastname
## 1 barchard_lack_of_record
## 2 ward
## 3 kirkbride_lack_of_record
## 4 murden
## 5 munroe_lack_of_record
## 6 terry
Here we give an example of how the generated linkage files can be used for linkage evaluation.
library(reclin)
library(dplyr)
# 'postcode' is used as the blocking variable.
linked_data_set <- pair_blocking(syn_lf1$linkage_file, syn_lf2$linkage_file, "postcode") %>%
compare_pairs(by = c("lastname", "firstname", "sex", "race"),
default_comparator = jaro_winkler(0.8)) %>%
score_problink(var = "weight") %>%
select_n_to_m("weight", var = "ntom", threshold = 0) %>%
link()
We can see out of 3000 individuals, there are only 2487 are matched using the method from reclin. This is because the block variable ‘postcode’ itself is unreliable as 10% of them has transposed characters.
Among the 2487 matched records, 2480 of them are true match and 7 of them are mismatched:
# This gives us the statistics of missed match
table(linked_data_set$nhsid.x == linked_data_set$nhsid.y)
##
## FALSE TRUE
## 7 2469
# These are records of missed match
head(linked_data_set[linked_data_set$nhsid.x != linked_data_set$nhsid.y,],7)
## age.x race.x sex.x nhsid.x postcode.x firstname.x
## 28 16.85226 <NA> Male 2650789344 CO11 2HX prince_|ack_of_record
## 251 NA <NA> Male 1457038692 IG5 0PQ 2tah
## 795 NA White <NA> 8502693476 EX5 3NE mrjlje
## 848 NA <NA> Male 6918257439 CT3 1BB findlay_lak_of_rceord
## 907 18.60372 White Male 6537890241 SW6 2UZ finn_lstck_of_record
## 1656 47.24609 White Male 6379140581 GU10 4LN tlyn
## 2131 29.21894 <NA> <NA> 2190384672 IG5 0PQ musatys
## lastname.x age.y race.y sex.y nhsid.y postcode.y
## 28 morley 29.58909 <NA> Male 7621054980 CO11 2HX
## 251 brattle NA <NA> Male 2190384672 IG5 0PQ
## 795 keating_lack_of_record 12.36731 White <NA> 3760598129 EX5 3NE
## 848 calderwood 52.64658 <NA> Male 3425098675 CT3 1BB
## 907 burrows_lack_of_record 29.50669 <NA> Male 8406925736 SW6 2UZ
## 1656 traynor_lack_of_record 44.82525 White Male 0257314989 GU10 4LN
## 2131 ashall_lack_of_record 47.91000 <NA> <NA> 1457038692 IG5 0PQ
## firstname.y lastname.y
## 28 atevan0 worsnop_lack_of_record
## 251 rnustafa_lakk_of_record ashall
## 795 maja_lac_k0f_rec0rd fielding_lack_of_record
## 848 yake_lack_of_rec:rd ogrady_lack_of_record
## 907 |sfie askwith_lack_of_record
## 1656 yaek shepherd_lack_of_record
## 2131 savahrv brattle_lack_of_record