library(sdglinkage)
set.seed(1234)

In this vignette, we show how we can use sdglinkage to generate synthetic datasets and compare the performance of the simulated data generated different generators.

  • Assumption:
    • We have a real dataset and we would like to generate a synthetic version of it.
  • Aim:
    • To generate synthetic dataset using different approaches.
    • To give a visual comparison of the generated synthetic data and real dataset.
    • To compare the predictive performance of the generated synthetic data with the real dataset.

Here we use ‘Adult’ dataset as an example. The Adult dataset was extracted from the US Census database in 1994; it contains 48,842 individual records with 13 personal variables. It is often used as a prediction task to determine whether a person makes over $50,000 a year given personal information. Here we set 70% of the data for training and the rest for evaluation.

adult_data <- split_data(adult, 70)

1 Generator 1: Bayesian Networks (BN) Learned by Structure Learning Algorithm

First, we need to define some constraints/evidence so that we only generate synthetic data that is realistic in real life. For example, as the name of the dataset ‘Adult’ suggests, the age of all individuals should be >= 18, and capital_gain should be a positive number, etc…

bn_evidence <- "age >=18 & capital_gain>=0 & capital_loss >=0 & hours_per_week>=0 & hours_per_week<=100"

We use hill-climbing (hc) as our structure Learning algorithm to learn the structure and parameters of our BN simultaneously.

bn_learn <- gen_bn_learn(adult_data$training_set, "hc", bn_evidence)

This is the structure of the learned BN:

plot_bn(bn_learn$structure)

This is the synthetic data sampled from the learned BN:

head(bn_learn$gen_data)
##        age workclass    education     marital_status       occupation
## 1 41.38689   Private    Doctorate Married-civ-spouse   Prof-specialty
## 2 28.15339   Private Some-college      Never-married     Adm-clerical
## 3 31.00409   Private      HS-grad      Never-married            Sales
## 4 43.59130   Private   Assoc-acdm Married-civ-spouse     Adm-clerical
## 5 27.44870   Private    Bachelors      Never-married            Sales
## 6 47.81161   Private         12th      Never-married Transport-moving
##    relationship  race    sex capital_gain capital_loss hours_per_week
## 1       Husband Black   Male     793.2070    752.19398       26.56382
## 2     Own-child Black Female     149.9903    286.48863       37.94588
## 3 Not-in-family White Female     144.8146    407.72358       37.28652
## 4       Husband White   Male    5950.9384    128.99863       48.86927
## 5 Not-in-family White Female    1351.5810    196.01637       52.06804
## 6     Own-child White   Male     773.6279     69.48008       42.23591
##   native_country income
## 1  United-States  <=50K
## 2  United-States  <=50K
## 3  United-States  <=50K
## 4  United-States   >50K
## 5  United-States  <=50K
## 6         Mexico  <=50K

2 Generator 2: BNs Learned from Expert Knowledge and Data

Here we elicited the dependencies of the variables within the dataset from an expert (the expert is me in this example..just as an example!).

bn_structure <- "[native_country][income][age|marital_status:education][sex][race|native_country][marital_status|race:sex][relationship|marital_status][education|sex:race][occupation|education][workclass|occupation][hours_per_week|occupation:workclass][capital_gain|occupation:workclass:income][capital_loss|occupation:workclass:income]"

We learn the parameters of the elicited BN using maximum likelihood estimation and sample synthetic data based on the previously defined constraints/evidences.

bn_elicit <- gen_bn_elicit(adult_data$training_set, bn_structure, bn_evidence)

This is the structure of the elicited BN:

plot_bn(bn_elicit$structure)

This is the synthetic data sampled from the elicited BN

head(bn_elicit$gen_data)
##        age workclass  education     marital_status      occupation
## 1 30.45752   Private       11th Married-civ-spouse Exec-managerial
## 2 28.22494 State-gov    HS-grad Married-civ-spouse   Other-service
## 3 67.72185   Private    HS-grad Married-civ-spouse   Other-service
## 4 46.07752   Private  Bachelors           Divorced Exec-managerial
## 5 48.88847 State-gov Assoc-acdm Married-civ-spouse    Adm-clerical
## 6 39.32429   Private    HS-grad Married-civ-spouse           Sales
##    relationship  race    sex capital_gain capital_loss hours_per_week
## 1       Husband White Female   25360.5930     75.55795       36.40986
## 2       Husband White   Male    7589.4676      0.00000       50.40334
## 3       Husband White Female     222.8207    145.14794       14.92066
## 4 Not-in-family White Female    1184.4227    616.23637       30.93981
## 5       Husband White   Male    3134.6799    374.28999       39.51819
## 6       Husband White   Male     872.3263    529.97861       50.57796
##   native_country income
## 1  United-States   >50K
## 2  United-States   >50K
## 3  United-States  <=50K
## 4  United-States  <=50K
## 5  United-States   >50K
## 6  United-States  <=50K

3 Generator 3: Classification and Regression Tree (CART)

Here we use the previously elicited structure as our sequence in generating classfication and regression tree for each variables.

cart_elicit <- gen_cart(adult_data$training_set, bn_structure)

This is the synthetic data generated from the elicited CART

head(cart_elicit$gen_data)
##   age   workclass education marital_status        occupation
## 1  30     Private Bachelors       Divorced      Adm-clerical
## 2  69     Private   HS-grad       Divorced             Sales
## 3  36     Private Doctorate       Divorced             Sales
## 4  45 Federal-gov Bachelors  Never-married    Prof-specialty
## 5  78     Private   HS-grad        Widowed Machine-op-inspct
## 6  35     Private   HS-grad  Never-married Machine-op-inspct
##     relationship               race    sex capital_gain capital_loss
## 1      Unmarried Asian-Pac-Islander Female            0            0
## 2  Not-in-family              White   Male            0            0
## 3        Husband              White   Male            0            0
## 4 Other-relative              White Female            0            0
## 5  Not-in-family              White Female            0            0
## 6        Husband              White   Male            0            0
##   hours_per_week native_country income
## 1             40  United-States  <=50K
## 2             50  United-States  <=50K
## 3             60  United-States  <=50K
## 4             55  United-States  <=50K
## 5             50  United-States  <=50K
## 6             40  United-States  <=50K

This gives a comparision of the synthetic data vs real data from the training set.

compare_cart(adult_data$training_set, cart_elicit$fit_model, c("age", "workclass", "sex"))

4 Evaluation of the Synthetic Data Generated by These Generators

We compare the synthetic data generated by these three generators with the real data from the training set.

Here is an discrete variable:

plot_compared_sdg(target_var = "race", training_set = adult_data$training_set,
                   syn_data_names = c("CART_elicit", "BN_learn", "BN_elicit"),
                   generated_data1 = cart_elicit$gen_data,
                   generated_data2 = bn_learn$gen_data,
                   generated_data3 = bn_elicit$gen_data)

Here is a continous variable:

plot_compared_sdg(target_var = "age", training_set = adult_data$training_set,
                   syn_data_names = c("CART_elicit", "BN_learn", "BN_elicit"),
                   generated_data1 = cart_elicit$gen_data,
                   generated_data2 = bn_learn$gen_data,
                   generated_data3 = bn_elicit$gen_data)

We assume good quality synthetic data would allow us to draw the same analytic conclusions as we can draw from real data. Hence, we compare the predictive performance of several machine learning algorithms that are trained with the synthetic data and tested by real data with those trained and tested both by real data. We use the variable ‘income’ as our prediction task to determine whether a person makes over $50,000 a year given personal information

library(mlr)
lrns <- makeLearners(c("rpart", "logreg"), type = "classif",
                     predict.type = "prob")
# lrns <- makeLearners(c("rpart", "logreg", "randomForest"), type = "classif",
#                      predict.type = "prob")
measurements <- list(acc, ber, f1, auc)
bmr <- compare_sdg(lrns, measurement = measurements, target_var = "income",
                      real_dataset = adult_data,
                      generated_data1 = cart_elicit$gen_data,
                      generated_data2 = bn_learn$gen_data,
                      generated_data3 = bn_elicit$gen_data)
names(bmr$results) <- c("Real_dataset", "CART_elicit", "BN_learn", "BN_elicit")

We can see in this example, models trained by data from CART and BN_learn both have very similar predictive performance as those trained by real dataset.

bmr
##        task.id     learner.id acc.test.mean ber.test.mean f1.test.mean
## 1 Real_dataset  classif.rpart     0.8401386     0.2742090    0.8997365
## 2 Real_dataset classif.logreg     0.8479511     0.2366056    0.9022321
## 3  CART_elicit  classif.rpart     0.8092571     0.3724296    0.8862117
## 4  CART_elicit classif.logreg     0.7984965     0.3762002    0.8786615
## 5     BN_learn  classif.rpart     0.7863355     0.3759166    0.8694321
## 6     BN_learn classif.logreg     0.8051297     0.3690590    0.8828430
## 7    BN_elicit  classif.rpart     0.7845666     0.4103314    0.8719835
## 8    BN_elicit classif.logreg     0.7888414     0.3887949    0.8727967
##   auc.test.mean
## 1     0.8430715
## 2     0.9026087
## 3     0.6285774
## 4     0.6223223
## 5     0.6259237
## 6     0.8447065
## 7     0.5896686
## 8     0.5469897