library(sdglinkage)
set.seed(1234)

In this vignette, we show how we can use sdglinkage to generate a synthetic version of sensitive identifiers for linkage methods development. This is particularly useful for people from a trusted third party that has the access to sensitive identifiers such as names and ID numbers and would like to share a synthetic yet realistic version of the identifiers to a wider audience (e.g. ALSPAC dataset). For people that has the access to sensitive identifiers please see vignette Generation_of_Gold_Standard_File_and_Linkage_Files.

  • Assumption:
    • Real gold standard file (real_gsf): We have a reference file of identifiers that best estimate the current ‘true’ value of the identifiers as recorded in our administrative database.
    • Real linkage files (real_lf): We have a follow-up file of identifiers that has been populated at a different period and collected by different operators.
  • Aim:
    • To detect errors in the real dataset and classify the errors associated with each individual into specific error types.
    • To replace sensitive variables with equivalent variables from another database.
    • To create several versions of synthetic identifiers, including synthetic gold standard file (syn_gsf) and synthetic linkage files (syn_lfs) that mimicking the change and error in real_lf as well as reproducing the variables and the dependencies across variables as in real_gsf.

1 ‘Real’ Dataset: real_gsf and real_lf

For confidentiality reasons, we are unable to release our experimental datasets, instead, for demonstration purpose, we create two versions of identifier datasets and consider them as our ‘real’ datasets.

  • real_gsf: The gold standard file consists of identifiers sex, nhsid, race, dob, uk firstname that depends on the person’s gender and age, and uk lastname.
  • real_lf: The linkage file is derived from real_gsf with errors including 10% of the race was missing, 45% the day and month of dob were transposed, 35% the firstname has been entered as its variant and 25% of the lastname has a typo.

1.1 real_gsf

This is what the real_gsf looks like

real_gsf <- data.frame(sex=sample(c('male', 'female'), 100, replace = TRUE))
real_gsf <- add_variable(real_gsf, "nhsid")
real_gsf <- real_gsf[,c(2, 1)]
real_gsf$race <- sample(1:6, 100, replace = TRUE)
real_gsf <- add_variable(real_gsf, "dob", age_dependency = FALSE)
real_gsf <- add_variable(real_gsf, "firstname", country = "uk", gender_dependency= TRUE, age_dependency = TRUE)
real_gsf <- add_variable(real_gsf, "lastname", country = "uk")
head(real_gsf)
##        nhsid    sex race        dob firstname lastname
## 1 0521968372   male    1 1936-11-05 alexander    kiely
## 2 0698713257 female    1 1977-05-09    nikita   veasey
## 3 1756849021 female    1 1945-11-01     lydia    fibbs
## 4 5426301786 female    2 1971-03-01     chloe   platts
## 5 9586341704 female    4 1995-05-15    louise hayhurst
## 6 6948320511 female    1 1936-01-15     chloe   conroy

1.2 real_lf

This is what the real_lf looks like. We can see some errors here such as in row3 the race is missing and dob was transposed from ‘1945-11-01’ to ‘1945-01-11’ and in row4 the name ‘charlotte’ was entered as its variant ‘carlotta’.

error_occurrence_flags <- data.frame(tmp=character(100))
error_occurrence_flags <- add_random_error(error_occurrence_flags, prob = c(0.90, 0.10), "race_missing")
error_occurrence_flags <- add_random_error(error_occurrence_flags, prob = c(0.55, 0.45), "dob_trans_date")
error_occurrence_flags <- add_random_error(error_occurrence_flags, prob = c(0.65, 0.35), "firstname_variant")
error_occurrence_flags <- add_random_error(error_occurrence_flags, prob = c(0.75, 0.25), "lastname_typo")
error_occurrence_flags$tmp <-NULL
real_lf <- damage_gold_standard(real_gsf, error_occurrence_flags)$linkage_file
## encoding error to:  race_missing_flag
## encoding error to:  dob_trans_date_flag
## encoding error to:  firstname_variant_flag
## encoding error to:  lastname_typo_flag
head(real_lf)
##        nhsid    sex race        dob            firstname lastname
## 1 0521968372   male    1 1936-11-05            alexander    kieky
## 2 0698713257 female    1 1977-05-09               nikita   veasey
## 3 1756849021 female   NA 1945-01-11                lydia    fibbs
## 4 5426301786 female    2 1971-03-01 chloe_lack_of_record   platts
## 5 9586341704 female    4 1995-05-15               louise hayhurst
## 6 6948320511 female    1 1936-15-01 chloe_lack_of_record   conroy

2 Detect and Classify Errors of Real Dataset

In the real world, we often do not know where the errors were recorded in the dataset. For a less maintained dataset, we have to manually compare its identifiers with the identifiers from the reference dataset. The clerical work is usually tedious and error-prone.

In this section, we show how to use sdglinkage to detect the inconsistency between real_lf and real_gsf and to classify the errors into different error categories.

2.1 Compare real_lf with real_gsf

Here we use nhsid as our unique identifiers to link real_gsf and real_lf and compares variables race, dob, firstname and lastname.

vars = list(c('race', 'race'), c('dob', 'dob'), c('firstname', 'firstname'), c('lastname', 'lastname'))
diffs.table = compare_two_df(real_gsf, real_lf, vars, 'nhsid')
diffs.table
##        var.x     var.y      nhsid   values.x     values.y row.x row.y
## 1       race      race 0832751693          3           NA    51    51
## 2       race      race  174039562          6           NA    41    41
## 3       race      race 1756849021          1           NA     3     3
## 4       race      race 1850293465          2           NA    71    71
## 5       race      race 5120983464          3           NA    36    36
## 6       race      race 7143680591          4           NA    37    37
## 7        dob       dob 0923168575 1986-12-17   1986-17-12    59    59
## 8        dob       dob 1756849021 1945-11-01   1945-01-11     3     3
## 9        dob       dob 1803529768 1929-06-24   1929-24-06    62    62
## 10       dob       dob  204189735 1994-01-17   1994-17-01    17    17
## 11       dob       dob 2053841966 1960-12-15   1960-15-12     8     8
## 12       dob       dob 2057136485 2005-05-08   2005-08-05    46    46
## 13       dob       dob  245071698 1960-04-21   1960-21-04    57    57
## 14       dob       dob 2680473515 1957-07-20   1957-20-07    68    68
## 15       dob       dob 2846701393 1906-11-06   1906-06-11    56    56
## 16       dob       dob 3506742817 1932-09-14   1932-14-09    10    10
## 17       dob       dob 3709241863 1940-02-18   1940-18-02    58    58
## 18       dob       dob 4267518300 2002-03-12   2002-12-03    73    73
## 19       dob       dob 4361879206 1914-04-19   1914-19-04    50    50
## 20       dob       dob 4706352185 1923-03-20   1923-20-03    42    42
## 21       dob       dob 4836019257 1908-08-01   1908-01-08    81    81
## 22       dob       dob 4960238178 1931-08-24   1931-24-08    31    31
## 23       dob       dob 5283174964 2008-05-27   2008-27-05    70    70
## 24       dob       dob 5346879129 1962-09-19   1962-19-09    85    85
## 25       dob       dob 5401873697 1936-06-03   1936-03-06    97    97
## 26       dob       dob 5971084267 1909-05-10   1909-10-05    79    79
## 27       dob       dob 6047953816 1908-08-12   1908-12-08    39    39
## 28       dob       dob 6349518020 1976-03-31   1976-31-03    53    53
## 29       dob       dob 6948320511 1936-01-15   1936-15-01     6     6
## 30       dob       dob 7431625898 1963-05-16   1963-16-05    83    83
## 31       dob       dob 7493562083 1966-04-28   1966-28-04    33    33
## 32       dob       dob 7895412604 1960-10-07   1960-07-10    34    34
## 33       dob       dob 8349720514 1927-03-10   1927-10-03    14    14
## 34       dob       dob  853617420 1989-01-12   1989-12-01    20    20
## 35       dob       dob 8690413251 2014-06-08   2014-08-06    40    40
## 36       dob       dob  910268537 2016-06-04   2016-04-06    32    32
## 37       dob       dob 9187263041 2014-11-26   2014-26-11    61    61
## 38       dob       dob 9271345800 1979-09-16   1979-16-09    77    77
## 39       dob       dob 9273156088 1915-01-29   1915-29-01    27    27
## 40       dob       dob 9360481521 1930-01-21   1930-21-01    93    93
## 41       dob       dob 9467821052 1951-05-17   1951-17-05    21    21
## 42       dob       dob 9821670350 1910-06-01   1910-01-06   100   100
## 43 firstname firstname 0571948324    charlie      charlee    90    90
## 44 firstname firstname 0739814257     elliot elliot_l....    87    87
## 45 firstname firstname  174039562    siobhan siobhan_....    41    41
## 46 firstname firstname  204189735    zachary    zachariah    17    17
## 47 firstname firstname 2548970612        ben        benji    64    64
## 48 firstname firstname 2846701393      chloe chloe_la....    56    56
## 49 firstname firstname 2846951306       jake jake_lac....     7     7
## 50 firstname firstname 2874560138     steven       stevan    48    48
## 51 firstname firstname 2961547032   isabella     isobelle    44    44
## 52 firstname firstname 4102376585  madeleine        magda    60    60
## 53 firstname firstname 4361879206    heather heather_....    50    50
## 54 firstname firstname 4836019257    anouska anouska_....    81    81
## 55 firstname firstname 5014268390       alex       alexys    67    67
## 56 firstname firstname 5283174964        dia dia_lack....    70    70
## 57 firstname firstname 5346879129  alexander         lexi    85    85
## 58 firstname firstname 5401873697       amin amin_lac....    97    97
## 59 firstname firstname 5426301786      chloe chloe_la....     4     4
## 60 firstname firstname 5971084267  alexander   aleksandar    79    79
## 61 firstname firstname 6321408751       leon       lennie    95    95
## 62 firstname firstname 6327854915    bethany         beth    80    80
## 63 firstname firstname 6584017397  alexander   aleksander    49    49
## 64 firstname firstname 6948320511      chloe chloe_la....     6     6
## 65 firstname firstname 7109324656  alexander    aleksandr    78    78
## 66 firstname firstname 7159842367        zac         zach    89    89
## 67 firstname firstname 7462018594       jake jake_lac....    13    13
## 68 firstname firstname 7493562083        tom       thomas    33    33
## 69 firstname firstname 8402915736      chloe chloe_la....    92    92
## 70 firstname firstname 9187263041      megan       meghan    61    61
## 71 firstname firstname 9271345800       jake jake_lac....    77    77
## 72 firstname firstname 9340862716      blake       blayke    65    65
## 73 firstname firstname 9360481521        ian         aian    93    93
## 74 firstname firstname 9562843017    bethany         beth    66    66
## 75 firstname firstname  985743021     lauren       laurel    29    29
## 76  lastname  lastname 0382179544     gilson       giloon    72    72
## 77  lastname  lastname 0521968372      kiely        kieky     1     1
## 78  lastname  lastname 0739814257     millar       millvr    87    87
## 79  lastname  lastname 2846951306    thacker      thafker     7     7
## 80  lastname  lastname 2961547032     mcbain       mcbaun    44    44
## 81  lastname  lastname 4102376585     gorman       gprman    60    60
## 82  lastname  lastname 4267518300   mitchell     mihchell    73    73
## 83  lastname  lastname 5120983464      woods        eoods    36    36
## 84  lastname  lastname 5283174964    tinkler      tinkoer    70    70
## 85  lastname  lastname 5904371284   harrison     harrispn    74    74
## 86  lastname  lastname 5971084267   walmsley     walms4ey    79    79
## 87  lastname  lastname 6210975836       howe         yowe    28    28
## 88  lastname  lastname 6327854915    dempsey      dempsuy    80    80
## 89  lastname  lastname 6804731255       ross         rosz    91    91
## 90  lastname  lastname 6815243974      moore        moofe    43    43
## 91  lastname  lastname 7042953819    willett      willeqt    30    30
## 92  lastname  lastname 7325960840       bell         bekl     9     9
## 93  lastname  lastname 8402915736        cox          cos    92    92
## 94  lastname  lastname  853617420      hobbs        hob9s    20    20
## 95  lastname  lastname 9340862716 harrington   hajrington    65    65
## 96  lastname  lastname 9467821052       goff         hoff    21    21

2.2 Classfy Errors

Here we show how to append error flags into real_gsf based on the difference between real_gsf and real_lf.

Here we detect if there is missing in the race variable, if yes, the individual will be flagged as 1 in the newly built race_missing_flag variable. The same princeple applies to the rest of the errors and variables.

real_gsf_with_flags = acquire_error_flag(real_gsf, diffs.table, 'race', 'missing')
real_gsf_with_flags = acquire_error_flag(real_gsf_with_flags, diffs.table, 'dob', 'trans_date')
real_gsf_with_flags = acquire_error_flag(real_gsf_with_flags, diffs.table, 'firstname', 'variant')
real_gsf_with_flags = acquire_error_flag(real_gsf_with_flags, diffs.table, 'lastname', 'typo')

error_occurrence_flags is the error we enter when creating the ‘real_lf’, and acquired_error_flags is the extracted and classified error from the ‘real_lf’.

head(error_occurrence_flags)
##   race_missing_flag dob_trans_date_flag firstname_variant_flag
## 1                 0                   0                      0
## 2                 0                   0                      0
## 3                 1                   1                      0
## 4                 0                   0                      1
## 5                 0                   0                      0
## 6                 0                   1                      1
##   lastname_typo_flag
## 1                  1
## 2                  0
## 3                  0
## 4                  0
## 5                  0
## 6                  0
acquired_error_flags = real_gsf_with_flags[grep('flag', colnames(real_gsf_with_flags))]
head(acquired_error_flags)
##   race_missing_flag dob_trans_date_flag firstname_variant_flag
## 1                 0                   0                      0
## 2                 0                   0                      0
## 3                 1                   1                      0
## 4                 0                   0                      1
## 5                 0                   0                      0
## 6                 0                   1                      1
##   lastname_typo_flag
## 1                  1
## 2                  0
## 3                  0
## 4                  0
## 5                  0
## 6                  0

Let’s compare the acquired_error_flags with error_occurrence_flags: if they are completely identical, then it means that our method has successfully extracted and classified the errors happened in the real_lf into the correct categories.

all.equal(acquired_error_flags, error_occurrence_flags)
## [1] "Component \"dob_trans_date_flag\": 1 string mismatch" 
## [2] "Component \"lastname_typo_flag\": 2 string mismatches"

There is one mismatch in the dob_trans_date_flag column, which is because the dob was the same after transposed.

real_gsf[c(28),]
##         nhsid    sex race        dob firstname lastname
## 28 6210975836 female    2 1992-11-11     chloe     howe
real_lf[c(28),]
##         nhsid    sex race        dob firstname lastname
## 28 6210975836 female    2 1992-11-11     chloe     yowe

Let’s fix it and we can see they are identifical now.

error_occurrence_flags$dob_trans_date_flag[c(28)] = 0
all.equal(acquired_error_flags, error_occurrence_flags)
## [1] "Component \"lastname_typo_flag\": 2 string mismatches"

3 Masked Sensitive Variables

Even though the synthetic data we generate later will be sampled from the generator, that means, the data is fully synthesised and cannot be linked back to real-world identifiers. But the sampling of sensitive variables such as names is from the real dataset and can be worrisome for some parties. Therefore, we also provide a function to replace these sensitive variables with variables from another database.

Previously, we generate the ‘real_gsf’ with firstname from the uk population that depends on the individual’s gender and age, here we show how we can replace them with firstname from us population that depends on the individual’s gender and race. We also replace the lastname from us population and randomly assign a new nhsid to each individual.

real_gsf_with_flags_replaced = replace_firstname(real_gsf_with_flags, country = 'us', gender_dependency = TRUE, race_dependency = TRUE)
real_gsf_with_flags_replaced = replace_lastname(real_gsf_with_flags_replaced, country = 'us', race_dependency = TRUE)
real_gsf_with_flags_replaced = replace_nhsid(real_gsf_with_flags_replaced)

This is what the original dataset looks like:

head(real_gsf_with_flags[colnames(real_gsf)])
##        nhsid    sex race        dob firstname lastname
## 1 0521968372   male    1 1936-11-05 alexander    kiely
## 2 0698713257 female    1 1977-05-09    nikita   veasey
## 3 1756849021 female    1 1945-11-01     lydia    fibbs
## 4 5426301786 female    2 1971-03-01     chloe   platts
## 5 9586341704 female    4 1995-05-15    louise hayhurst
## 6 6948320511 female    1 1936-01-15     chloe   conroy

This is what the replaced dataset looks like:

head(real_gsf_with_flags_replaced[colnames(real_gsf)])
##        nhsid    sex race        dob   firstname lastname
## 1 9521047364   male    1 1936-11-05 christopher    watts
## 2 0938425714 female    1 1977-05-09      olivia    adams
## 3 8240139655 female    1 1945-11-01    courtney    davis
## 4 7062843598 female    2 1971-03-01       emily    chung
## 5 3254169808 female    4 1995-05-15      ashley  salazar
## 6 1827306947 female    1 1936-01-15    samantha   wilson

4 Generate Synthetic Identifiers

Here we show how to use our generator to generate synthetic identifiers from the real_gsf_with_flags_replaced. More details about the performance of the generator please see vignette Synthetic_Data_Generation_and_Evaluation.

# Here we set the variables into the right format for the generator
real_gsf_with_flags_replaced[colnames(real_gsf_with_flags_replaced)] <- lapply(real_gsf_with_flags_replaced[colnames(real_gsf_with_flags_replaced)], factor) 

# We use learned bn to train a generator
bn_learn <- gen_bn_learn(real_gsf_with_flags_replaced, "hc")

# syn_gsf is the generated synthetic gold standard file
syn_gsf = bn_learn$gen_data
head(syn_gsf)
##        nhsid    sex race        dob firstname lastname race_missing_flag
## 1 8192354679 female    5 1936-06-03   kaitlyn   wright                 0
## 2 3689124751   male    1 1995-05-15   jumaana     wall                 0
## 3 1827306947 female    6 1979-10-27     emily   flores                 0
## 4 0759186340 female    3 1942-03-30   jasmine     kang                 0
## 5  295734601 female    1 1982-03-03     kayla   romero                 0
## 6 0826359140   male    3 1945-11-01     kevin     reed                 0
##   dob_trans_date_flag firstname_variant_flag lastname_typo_flag
## 1                   1                      0                  0
## 2                   0                      1                  0
## 3                   0                      0                  1
## 4                   0                      0                  0
## 5                   0                      0                  0
## 6                   1                      1                  0
# syn_lf1 and syn_lf2 are the synthetic linkage files that were damaged by the inferred error occurrence in the syn_gsf
syn_error_occurrence_1 <- bn_flag_inference(syn_gsf, bn_learn$fit_model)
syn_lf1 <- damage_gold_standard(syn_gsf, syn_error_occurrence_1)
## encoding error to:  race_missing_flag
## encoding error to:  dob_trans_date_flag
## encoding error to:  firstname_variant_flag
## encoding error to:  lastname_typo_flag
head(syn_lf1$linkage_file)
##        nhsid    sex race        dob              firstname lastname
## 1 8192354679 female    5 1936-06-03                kaitlyn   wright
## 2 3689124751   male    1 1995-05-15 jumaana_lack_of_record     wall
## 3 1827306947 female    6 1979-27-10                  emily   flores
## 4 0759186340 female    3 1942-03-30                jasmine     kang
## 5  295734601 female    1 1982-03-03                  koula   romerp
## 6 0826359140   male    3 1945-01-11                  kevin     reed
##   race_missing_flag dob_trans_date_flag firstname_variant_flag
## 1                 0                   1                      0
## 2                 0                   0                      1
## 3                 0                   0                      0
## 4                 0                   0                      0
## 5                 0                   0                      0
## 6                 0                   1                      1
##   lastname_typo_flag
## 1                  0
## 2                  0
## 3                  1
## 4                  0
## 5                  0
## 6                  0
syn_error_occurrence_2 <- bn_flag_inference(syn_gsf, bn_learn$fit_model)
syn_lf2 <- damage_gold_standard(syn_gsf, syn_error_occurrence_2)
## encoding error to:  race_missing_flag
## encoding error to:  dob_trans_date_flag
## encoding error to:  firstname_variant_flag
## encoding error to:  lastname_typo_flag
head(syn_lf2$linkage_file)
##        nhsid    sex race        dob              firstname lastname
## 1 8192354679 female    5 1936-06-03                kaitlyn   sright
## 2 3689124751   male    1 1995-05-15 jumaana_lack_of_record     walk
## 3 1827306947 female    6 1979-10-27                  emily   flores
## 4 0759186340 female    3 1942-30-03                 jasmin     kang
## 5  295734601 female    1 1982-03-03                  kayla   romero
## 6 0826359140   male    3 1945-01-11                  kevin     reed
##   race_missing_flag dob_trans_date_flag firstname_variant_flag
## 1                 0                   1                      0
## 2                 0                   0                      1
## 3                 0                   0                      0
## 4                 0                   0                      0
## 5                 0                   0                      0
## 6                 0                   1                      1
##   lastname_typo_flag
## 1                  0
## 2                  0
## 3                  1
## 4                  0
## 5                  0
## 6                  0

5 Use syn_lf1 and syn_lf2 for Linkage Methods Evaluation

Here we give an example of how the generated linkage files can be used for linkage evaluation.

library(reclin)
library(dplyr)

linked_data_set <- pair_blocking(syn_lf1$linkage_file, syn_lf2$linkage_file, "dob") %>%
  compare_pairs(by = c("lastname", "firstname", "sex", "race"),
                default_comparator = jaro_winkler(0.8)) %>%
  score_problink(var = "weight") %>%
  select_n_to_m("weight", var = "ntom", threshold = 0) %>%
  link()

We can see out of 100 individuals, there are only 59 are matched using the method from reclin. This is because the block variable ‘dob’ itself is unreliable as 45% of them has transposed date error.

Among the 59 matched records, 56 of them are true match and 3 of them are mismatched:

table(linked_data_set$nhsid.x == linked_data_set$nhsid.y)
## 
## FALSE  TRUE 
##     1    53
head(linked_data_set[linked_data_set$nhsid.x != linked_data_set$nhsid.y,],3)
##         nhsid.x  sex.x race.x      dob.x firstname.x lastname.x
## 22   9521047364 female      5 1936-01-15        alex  el-baccus
## NA         <NA>   <NA>   <NA>       <NA>        <NA>       <NA>
## NA.1       <NA>   <NA>   <NA>       <NA>        <NA>       <NA>
##      race_missing_flag.x dob_trans_date_flag.x firstname_variant_flag.x
## 22                     0                     0                        0
## NA                  <NA>                  <NA>                     <NA>
## NA.1                <NA>                  <NA>                     <NA>
##      lastname_typo_flag.x    nhsid.y sex.y race.y      dob.y firstname.y
## 22                      0 7241956087  male      5 1936-01-15      steven
## NA                   <NA>       <NA>  <NA>   <NA>       <NA>        <NA>
## NA.1                 <NA>       <NA>  <NA>   <NA>       <NA>        <NA>
##      lastname.y race_missing_flag.y dob_trans_date_flag.y
## 22        watts                   0                     0
## NA         <NA>                <NA>                  <NA>
## NA.1       <NA>                <NA>                  <NA>
##      firstname_variant_flag.y lastname_typo_flag.y
## 22                          0                    0
## NA                       <NA>                 <NA>
## NA.1                     <NA>                 <NA>