Linkage of administrative data sources is an efficient approach for conducting research on large populations, avoiding the time and cost of traditional data collection methods. Careful development of methods for linking data where unique identifiers are not available is key to avoiding bias resulting from linkage errors. However, the development and evaluation of new methods are limited by restricted access to identifier data for these purposes. Generating synthetic datasets of personal identifiers, which replicate the frequencies and errors of identifiers observed in administrative data, could facilitate the development of new methods.
We aimed to develop the sdglinkage package for generating synthetic dataset for linkage method development, with i) gold standard file with complete and accurate information and ii) linkage files that are corrupted as we often see in raw dataset.
The package has several main types of functions:
These functions can be organised as:
Workflow of Synthetic Data Generation.
We also provide three vignettes to show how we can use the package: