library(sdglinkage)
set.seed(1234)

In this vignette, we show how we can use sdglinkage to generate a realistic synthetic gold standard file and how to damage the gold standard file into multiple copies of linkage files that can be used for linkage research.

Usually, when a trusted third party release a dataset to a research organisation, they will remove sensitive identifiers to prevent the data to be linked back to real individuals. In the meanwhile, for research purpose, some organisations will publish the error statistics happened in their dataset. This vignette targets for people from a research organisation that has access to non-sensitive predictor variables and statistics of the error occurred to both predictor variables and sensitive identifiers. And they would like to share a synthetic gold standard and linkage files of this dataset with realistic identifiers to a wider audience. For people from a trusted third party that has the access to sensitive identifiers please see vignette From_Sensitive_Real_Identifiers_to_Synthetic_Identifiers.

Assumption:
- Real gold standard file (real_gsf): We have a gold standard file with non-sensitive predictor variables that we would like to synthesised.
- Error statistics: We have the statistics of the errors occurred to both predictor variables and sensitive identifiers.
Aim:
- To generate synthetic predictor variables.
- To add external identifier variables to the synthetic predictor variables, which is considered as our synthetic gold standard file (syn_gsf).
- To damage the synthetic gold standard file with the error statistics, which gives us the synthetic linkage file (syn_lf).
- To show how these linkage files can be used for linkage method evaluation.

A gold standard file that gives us the true values of variables of interest, and linkage files that mimic the original error formats of real data. The following figure outlines the framework in generating these two types of files. In this example, we assume we have access to predictor variables such as sex, age and ethnicity but not sensitive identifiers such as nhsid and names. We also assume to know the errors occurred in all variables. We simulate three types of variables, which includes predictor variables learned together with the encoded error flags, external dependent identifiers and independent identifiers. These generated synthetic variables are then merged into a gold standard file and further damaged by inferred synthetic errors, which give us synthetic linkage files.

Generating gold standard file and linkage files.

1 Generate Gold Standard File

A gold standard file consists of predictor variables and identifier variables. In this section we should how we can generate synthetic predictor variables that we have access to and append them with synthetic external dependent identifiers and independent identifiers that we do not have access to.

1.1 Training Data with One-Hot Encoded Error Flags

If we know where and what type of error had happened in the dataset, we would like to encode the position and type of error using one-hot encoding. For example, flag Edward’s name variant error as ‘1’ if his name record was typed as ‘Eddy’. This gives us a training dataset with one-hot encoded error flags. Bear in mind that we should build the error flags for all variables of interests, including independent identifiers - because even though the values of independent identifiers are independent of the values of other variables, the occurrence of errors of them can depend on the values or occurrence of errors of other variables. For example, a person from a minority group is more likely to have a missing value in the nhsid variable than those were born in the UK. This results in a training dataset with many error flags together with the values of the dependent variables.

In vignette From_Sensitive_Real_Identifiers_to_Synthetic_Identifiers we show how to extract these errors if we have access to the corrupt files. In this subsection, we show how to encode these error flags in case we do not have access to the error but have information about the error statistics.

We use the ‘age’, ‘race’ and ‘sex’ variables from the ‘Adult’ dataset as an example of real predictor variables. Meanwhile, we know the statistics and errors consist in our target dataset, e.g. 30% of the age is missing and 50% of the race is missing.

real_gsf = adult[c('age', 'race', 'sex')][1:3000,]
real_gsf_with_flags <- add_random_error(real_gsf, prob = c(0.70, 0.30), "age_missing")
real_gsf_with_flags <- add_random_error(real_gsf_with_flags, prob = c(0.50, 0.50), "race_missing")
real_gsf_with_flags <- add_random_error(real_gsf_with_flags, prob = c(0.65, 0.35), "sex_missing")
real_gsf_with_flags <- add_random_error(real_gsf_with_flags, prob = c(0.90, 0.10), "postcode_trans_char")
real_gsf_with_flags <- add_random_error(real_gsf_with_flags, prob = c(0.50, 0.50), "firstname_variant")
real_gsf_with_flags <- add_random_error(real_gsf_with_flags, prob = c(0.50, 0.50), "lastname_variant")
real_gsf_with_flags <- add_random_error(real_gsf_with_flags, prob = c(0.50, 0.50), "firstname_typo")
real_gsf_with_flags <- add_random_error(real_gsf_with_flags, prob = c(0.50, 0.50),"firstname_pho")
real_gsf_with_flags <- add_random_error(real_gsf_with_flags, prob = c(0.50, 0.50), "firstname_ocr")
real_gsf_with_flags <- add_random_error(real_gsf_with_flags, prob = c(0.50, 0.50),"firstname_trans_char")
head(real_gsf_with_flags)

##   age  race    sex age_missing_flag race_missing_flag sex_missing_flag
## 1  39 White   Male                0                 0                0
## 2  50 White   Male                0                 1                0
## 3  38 White   Male                0                 1                1
## 4  53 Black   Male                0                 1                0
## 5  28 Black Female                1                 1                0
## 6  37 White Female                0                 1                0
##   postcode_trans_char_flag firstname_variant_flag lastname_variant_flag
## 1                        0                      0                     1
## 2                        0                      0                     1
## 3                        0                      1                     1
## 4                        0                      0                     1
## 5                        1                      1                     1
## 6                        1                      0                     1
##   firstname_typo_flag firstname_pho_flag firstname_ocr_flag
## 1                   0                  0                  0
## 2                   1                  0                  0
## 3                   0                  1                  1
## 4                   0                  0                  1
## 5                   1                  1                  1
## 6                   0                  1                  0
##   firstname_trans_char_flag
## 1                         1
## 2                         1
## 3                         0
## 4                         1
## 5                         1
## 6                         0

1.2 Generate Synthetic Predictor Variables

We use BNs to learn the dependency and parameters of the training dataset and sample data from the trained model. The generated data not only preserves the relationships and statistics between the variables, but also the occurrence of errors (this is useful for the inference for linkage file that is introduced later).

bn_learn <- gen_bn_learn(real_gsf_with_flags, "hc")
syn_dependent <- bn_learn$gen_data[, !grepl("flag", colnames(bn_learn$gen_data))]
head(syn_dependent)

##        age               race    sex
## 1 45.44689 Amer-Indian-Eskimo   Male
## 2 18.04504              White   Male
## 3 33.22080              White   Male
## 4 32.03759              White   Male
## 5 44.81910              White Female
## 6 30.49562              White Female

1.3 Add Synthetic Independent Identifiers Following Rules

Here we randomly assign an nhsid and an address to each individual. nhsid is generated using the Modulus 11 Algorithm, and address is sampled from a real uk address database.

syn_gsf <- add_variable(syn_dependent, "nhsid")
syn_gsf <- add_variable(syn_gsf, "address")
syn_gsf$country <-NULL
syn_gsf$primary_care_trust <-NULL
syn_gsf$longitude <-NULL
syn_gsf$latitude <-NULL
head(syn_gsf)

##        age               race    sex      nhsid postcode
## 1 45.44689 Amer-Indian-Eskimo   Male 6531420790  M22 5BF
## 2 18.04504              White   Male 6749102387  NN7 4LT
## 3 33.22080              White   Male 7543092867 SS13 1PZ
## 4 32.03759              White   Male 3617948251 AB32 7AT
## 5 44.81910              White Female 4102587969 EX22 7BG
## 6 30.49562              White Female 7653401988 CF14 5FT

1.4 Add External Dependent Identifiers From Similar Real Datasets

Firstname and lastname are two sensitive identifiers that are often removed when releasing the dataset to another organisation. But meanwhile, several organisations have published databases of names given different population. We make use of these resources and build a uk firstname database that depends on gender and age, uk lastname database, us firstname database that depends on gender and race and us lastname database that depends on the race.

Here we randomly assign a firstname and a lastname to an individual based on the value of gender and age and the frequency of the names. Firstname and lastname are sampled from a real uk database of baby birth name ranging from 1996 to 2018. Together with the synthetic predictors and independent identifiers, we have the synthetic gold standard file.

syn_gsf <- add_variable(syn_gsf, "firstname", country = "uk", gender_dependency = TRUE, age_dependency = TRUE)
syn_gsf <- add_variable(syn_gsf, "lastname", country = "uk")
head(syn_gsf)

##        age               race    sex      nhsid postcode firstname
## 1 45.44689 Amer-Indian-Eskimo   Male 6531420790  M22 5BF      tony
## 2 18.04504              White   Male 6749102387  NN7 4LT    oliver
## 3 33.22080              White   Male 7543092867 SS13 1PZ    steven
## 4 32.03759              White   Male 3617948251 AB32 7AT      leon
## 5 44.81910              White Female 4102587969 EX22 7BG     chloe
## 6 30.49562              White Female 7653401988 CF14 5FT     chloe
##    lastname
## 1  barchard
## 2      ward
## 3 kirkbride
## 4    murden
## 5    munroe
## 6     terry

2 Generate Linkage Files

The linkage files are copies of the gold standard file that were damaged by several damage actions. In this section, we show how to generate two linkage files that can be used for linkage activity.

2.1 Inference Multiple Synthetic Error Occurrence Files

The error occurrence files are inferenced using the previously trained model based on the record of each individual. This gives us the guidance of the damage actions.

syn_error_occurrence1 <- bn_flag_inference(bn_learn$gen_data, bn_learn$fit_model)
head(syn_error_occurrence1)

##   age_missing_flag race_missing_flag sex_missing_flag
## 1                0                 0                0
## 2                0                 0                0
## 3                1                 0                1
## 4                0                 1                0
## 5                0                 0                0
## 6                0                 1                1
##   postcode_trans_char_flag firstname_variant_flag lastname_variant_flag
## 1                        0                      0                     0
## 2                        0                      1                     0
## 3                        0                      1                     1
## 4                        0                      1                     1
## 5                        0                      0                     1
## 6                        0                      0                     1
##   firstname_typo_flag firstname_pho_flag firstname_ocr_flag
## 1                   0                  0                  1
## 2                   0                  1                  1
## 3                   1                  1                  1
## 4                   1                  1                  1
## 5                   1                  0                  0
## 6                   0                  0                  1
##   firstname_trans_char_flag
## 1                         1
## 2                         1
## 3                         0
## 4                         0
## 5                         0
## 6                         0

syn_error_occurrence2 <- bn_flag_inference(bn_learn$gen_data, bn_learn$fit_model)
head(syn_error_occurrence2)

##   age_missing_flag race_missing_flag sex_missing_flag
## 1                0                 1                1
## 2                0                 0                0
## 3                0                 1                0
## 4                1                 0                0
## 5                1                 0                1
## 6                1                 0                0
##   postcode_trans_char_flag firstname_variant_flag lastname_variant_flag
## 1                        1                      0                     1
## 2                        0                      1                     0
## 3                        0                      1                     1
## 4                        0                      0                     0
## 5                        0                      0                     1
## 6                        0                      1                     0
##   firstname_typo_flag firstname_pho_flag firstname_ocr_flag
## 1                   0                  0                  0
## 2                   0                  1                  0
## 3                   1                  0                  1
## 4                   1                  0                  0
## 5                   1                  1                  0
## 6                   1                  1                  0
##   firstname_trans_char_flag
## 1                         0
## 2                         0
## 3                         0
## 4                         1
## 5                         1
## 6                         1

2.2 Damage Gold Standard File According to The Error Occurrence

Here we damage the gold standard file based on the inferred occurrence of the errors.

syn_lf1 <- damage_gold_standard(syn_gsf, syn_error_occurrence1)

## encoding error to:  age_missing_flag

## encoding error to:  race_missing_flag

## encoding error to:  sex_missing_flag

## encoding error to:  postcode_trans_char_flag

## encoding error to:  firstname_variant_flag

## encoding error to:  lastname_variant_flag

## encoding error to:  firstname_typo_flag

## encoding error to:  firstname_pho_flag

## encoding error to:  firstname_ocr_flag

## encoding error to:  firstname_trans_char_flag

head(syn_lf1$linkage_file)

##        age               race    sex      nhsid postcode firstname
## 1 45.44689 Amer-Indian-Eskimo   Male 6531420790  M22 5BF      t0yn
## 2 18.04504              White   Male 6749102387  NN7 4LT      lo'l
## 3       NA              White   <NA> 7543092867 SS13 1PZ  5trephan
## 4 32.03759               <NA>   Male 3617948251 AB32 7AT   IeInard
## 5 44.81910              White Female 4102587969 EX22 7BG     chlo0
## 6 30.49562               <NA>   <NA> 7653401988 CF14 5FT     cbloe
##                   lastname
## 1                 barchard
## 2                     ward
## 3 kirkbride_lack_of_record
## 4    murden_lack_of_record
## 5    munroe_lack_of_record
## 6     terry_lack_of_record

head(syn_lf1$error_log)

##   age_missing_flag race_missing_flag sex_missing_flag
## 1                0                 0                0
## 2                0                 0                0
## 3             <NA>                 0             <NA>
## 4                0              <NA>                0
## 5                0                 0                0
## 6                0              <NA>             <NA>
##   postcode_trans_char_flag firstname_variant_flag
## 1                        0                      0
## 2                        0            oliver>olie
## 3                        0         steven>stephan
## 4                        0          leon>leonhard
## 5                        0                      0
## 6                        0                      0
##                lastname_variant_flag firstname_typo_flag
## 1                                  0                   0
## 2                                  0                   0
## 3 kirkbride>kirkbride_lack_of_record               t<r<2
## 4       murden>murden_lack_of_record               o<l<3
## 5       munroe>munroe_lack_of_record               e<0<5
## 6         terry>terry_lack_of_record                   0
##   firstname_pho_flag firstname_ocr_flag firstname_trans_char_flag
## 1                  0            o>0>all               ny>trans>34
## 2           ie>i>all           i>'l>all               ol>trans>12
## 3           s>st>all            s>5>all                         0
## 4            h>@>all            l>I>all                         0
## 5                  0                  0                         0
## 6                  0            h>b>all                         0

syn_lf2 <- damage_gold_standard(syn_gsf, syn_error_occurrence2)

## encoding error to:  age_missing_flag

## encoding error to:  race_missing_flag

## encoding error to:  sex_missing_flag

## encoding error to:  postcode_trans_char_flag

## encoding error to:  firstname_variant_flag

## encoding error to:  lastname_variant_flag

## encoding error to:  firstname_typo_flag

## encoding error to:  firstname_pho_flag

## encoding error to:  firstname_ocr_flag

## encoding error to:  firstname_trans_char_flag

head(syn_lf2$linkage_file)

##        age  race    sex      nhsid postcode            firstname
## 1 45.44689  <NA>   <NA> 6531420790  M225 BF                 tony
## 2 18.04504 White   Male 6749102387  NN7 4LT                 olli
## 3 33.22080  <NA>   Male 7543092867 SS13 1PZ              sgepben
## 4       NA White   Male 3617948251 AB32 7AT                 lemo
## 5       NA White   <NA> 4102587969 EX22 7BG                nkloe
## 6       NA White Female 7653401988 CF14 5FT chloe_lack_of_eecotr
##                   lastname
## 1  barchard_lack_of_record
## 2                     ward
## 3 kirkbride_lack_of_record
## 4                   murden
## 5    munroe_lack_of_record
## 6                    terry

3 Use the Synthetic Linkage Files To Evaluate the Performance of Linkage Methods

Here we give an example of how the generated linkage files can be used for linkage evaluation.

library(reclin)
library(dplyr)
# 'postcode' is used as the blocking variable. 
linked_data_set <- pair_blocking(syn_lf1$linkage_file, syn_lf2$linkage_file, "postcode") %>%
  compare_pairs(by = c("lastname", "firstname", "sex", "race"),
                default_comparator = jaro_winkler(0.8)) %>%
  score_problink(var = "weight") %>%
  select_n_to_m("weight", var = "ntom", threshold = 0) %>%
  link()

We can see out of 3000 individuals, there are only 2487 are matched using the method from reclin. This is because the block variable ‘postcode’ itself is unreliable as 10% of them has transposed characters.

Among the 2487 matched records, 2480 of them are true match and 7 of them are mismatched:

# This gives us the statistics of missed match
table(linked_data_set$nhsid.x == linked_data_set$nhsid.y)

## 
## FALSE  TRUE 
##     7  2469

# These are records of missed match
head(linked_data_set[linked_data_set$nhsid.x != linked_data_set$nhsid.y,],7)

##         age.x race.x sex.x    nhsid.x postcode.x           firstname.x
## 28   16.85226   <NA>  Male 2650789344   CO11 2HX prince_|ack_of_record
## 251        NA   <NA>  Male 1457038692    IG5 0PQ                  2tah
## 795        NA  White  <NA> 8502693476    EX5 3NE                mrjlje
## 848        NA   <NA>  Male 6918257439    CT3 1BB findlay_lak_of_rceord
## 907  18.60372  White  Male 6537890241    SW6 2UZ  finn_lstck_of_record
## 1656 47.24609  White  Male 6379140581   GU10 4LN                  tlyn
## 2131 29.21894   <NA>  <NA> 2190384672    IG5 0PQ               musatys
##                  lastname.x    age.y race.y sex.y    nhsid.y postcode.y
## 28                   morley 29.58909   <NA>  Male 7621054980   CO11 2HX
## 251                 brattle       NA   <NA>  Male 2190384672    IG5 0PQ
## 795  keating_lack_of_record 12.36731  White  <NA> 3760598129    EX5 3NE
## 848              calderwood 52.64658   <NA>  Male 3425098675    CT3 1BB
## 907  burrows_lack_of_record 29.50669   <NA>  Male 8406925736    SW6 2UZ
## 1656 traynor_lack_of_record 44.82525  White  Male 0257314989   GU10 4LN
## 2131  ashall_lack_of_record 47.91000   <NA>  <NA> 1457038692    IG5 0PQ
##                  firstname.y              lastname.y
## 28                   atevan0  worsnop_lack_of_record
## 251  rnustafa_lakk_of_record                  ashall
## 795      maja_lac_k0f_rec0rd fielding_lack_of_record
## 848      yake_lack_of_rec:rd   ogrady_lack_of_record
## 907                    |sfie  askwith_lack_of_record
## 1656                    yaek shepherd_lack_of_record
## 2131                 savahrv  brattle_lack_of_record

Generation of Gold Standard File and Linkage Files

Haoyuan Zhang

2/20/2020