library(sdglinkage)
set.seed(1234)
In this vignette, we show how we can use sdglinkage to generate synthetic datasets and compare the performance of the simulated data generated different generators.
Here we use ‘Adult’ dataset as an example. The Adult dataset was extracted from the US Census database in 1994; it contains 48,842 individual records with 13 personal variables. It is often used as a prediction task to determine whether a person makes over $50,000 a year given personal information. Here we set 70% of the data for training and the rest for evaluation.
adult_data <- split_data(adult, 70)
First, we need to define some constraints/evidence so that we only generate synthetic data that is realistic in real life. For example, as the name of the dataset ‘Adult’ suggests, the age of all individuals should be >= 18, and capital_gain should be a positive number, etc…
bn_evidence <- "age >=18 & capital_gain>=0 & capital_loss >=0 & hours_per_week>=0 & hours_per_week<=100"
We use hill-climbing (hc) as our structure Learning algorithm to learn the structure and parameters of our BN simultaneously.
bn_learn <- gen_bn_learn(adult_data$training_set, "hc", bn_evidence)
This is the structure of the learned BN:
plot_bn(bn_learn$structure)
This is the synthetic data sampled from the learned BN:
head(bn_learn$gen_data)
## age workclass education marital_status occupation
## 1 41.38689 Private Doctorate Married-civ-spouse Prof-specialty
## 2 28.15339 Private Some-college Never-married Adm-clerical
## 3 31.00409 Private HS-grad Never-married Sales
## 4 43.59130 Private Assoc-acdm Married-civ-spouse Adm-clerical
## 5 27.44870 Private Bachelors Never-married Sales
## 6 47.81161 Private 12th Never-married Transport-moving
## relationship race sex capital_gain capital_loss hours_per_week
## 1 Husband Black Male 793.2070 752.19398 26.56382
## 2 Own-child Black Female 149.9903 286.48863 37.94588
## 3 Not-in-family White Female 144.8146 407.72358 37.28652
## 4 Husband White Male 5950.9384 128.99863 48.86927
## 5 Not-in-family White Female 1351.5810 196.01637 52.06804
## 6 Own-child White Male 773.6279 69.48008 42.23591
## native_country income
## 1 United-States <=50K
## 2 United-States <=50K
## 3 United-States <=50K
## 4 United-States >50K
## 5 United-States <=50K
## 6 Mexico <=50K
Here we elicited the dependencies of the variables within the dataset from an expert (the expert is me in this example..just as an example!).
bn_structure <- "[native_country][income][age|marital_status:education][sex][race|native_country][marital_status|race:sex][relationship|marital_status][education|sex:race][occupation|education][workclass|occupation][hours_per_week|occupation:workclass][capital_gain|occupation:workclass:income][capital_loss|occupation:workclass:income]"
We learn the parameters of the elicited BN using maximum likelihood estimation and sample synthetic data based on the previously defined constraints/evidences.
bn_elicit <- gen_bn_elicit(adult_data$training_set, bn_structure, bn_evidence)
This is the structure of the elicited BN:
plot_bn(bn_elicit$structure)
This is the synthetic data sampled from the elicited BN
head(bn_elicit$gen_data)
## age workclass education marital_status occupation
## 1 30.45752 Private 11th Married-civ-spouse Exec-managerial
## 2 28.22494 State-gov HS-grad Married-civ-spouse Other-service
## 3 67.72185 Private HS-grad Married-civ-spouse Other-service
## 4 46.07752 Private Bachelors Divorced Exec-managerial
## 5 48.88847 State-gov Assoc-acdm Married-civ-spouse Adm-clerical
## 6 39.32429 Private HS-grad Married-civ-spouse Sales
## relationship race sex capital_gain capital_loss hours_per_week
## 1 Husband White Female 25360.5930 75.55795 36.40986
## 2 Husband White Male 7589.4676 0.00000 50.40334
## 3 Husband White Female 222.8207 145.14794 14.92066
## 4 Not-in-family White Female 1184.4227 616.23637 30.93981
## 5 Husband White Male 3134.6799 374.28999 39.51819
## 6 Husband White Male 872.3263 529.97861 50.57796
## native_country income
## 1 United-States >50K
## 2 United-States >50K
## 3 United-States <=50K
## 4 United-States <=50K
## 5 United-States >50K
## 6 United-States <=50K
Here we use the previously elicited structure as our sequence in generating classfication and regression tree for each variables.
cart_elicit <- gen_cart(adult_data$training_set, bn_structure)
This is the synthetic data generated from the elicited CART
head(cart_elicit$gen_data)
## age workclass education marital_status occupation
## 1 30 Private Bachelors Divorced Adm-clerical
## 2 69 Private HS-grad Divorced Sales
## 3 36 Private Doctorate Divorced Sales
## 4 45 Federal-gov Bachelors Never-married Prof-specialty
## 5 78 Private HS-grad Widowed Machine-op-inspct
## 6 35 Private HS-grad Never-married Machine-op-inspct
## relationship race sex capital_gain capital_loss
## 1 Unmarried Asian-Pac-Islander Female 0 0
## 2 Not-in-family White Male 0 0
## 3 Husband White Male 0 0
## 4 Other-relative White Female 0 0
## 5 Not-in-family White Female 0 0
## 6 Husband White Male 0 0
## hours_per_week native_country income
## 1 40 United-States <=50K
## 2 50 United-States <=50K
## 3 60 United-States <=50K
## 4 55 United-States <=50K
## 5 50 United-States <=50K
## 6 40 United-States <=50K
This gives a comparision of the synthetic data vs real data from the training set.
compare_cart(adult_data$training_set, cart_elicit$fit_model, c("age", "workclass", "sex"))
We compare the synthetic data generated by these three generators with the real data from the training set.
Here is an discrete variable:
plot_compared_sdg(target_var = "race", training_set = adult_data$training_set,
syn_data_names = c("CART_elicit", "BN_learn", "BN_elicit"),
generated_data1 = cart_elicit$gen_data,
generated_data2 = bn_learn$gen_data,
generated_data3 = bn_elicit$gen_data)
Here is a continous variable:
plot_compared_sdg(target_var = "age", training_set = adult_data$training_set,
syn_data_names = c("CART_elicit", "BN_learn", "BN_elicit"),
generated_data1 = cart_elicit$gen_data,
generated_data2 = bn_learn$gen_data,
generated_data3 = bn_elicit$gen_data)
We assume good quality synthetic data would allow us to draw the same analytic conclusions as we can draw from real data. Hence, we compare the predictive performance of several machine learning algorithms that are trained with the synthetic data and tested by real data with those trained and tested both by real data. We use the variable ‘income’ as our prediction task to determine whether a person makes over $50,000 a year given personal information
library(mlr)
lrns <- makeLearners(c("rpart", "logreg"), type = "classif",
predict.type = "prob")
# lrns <- makeLearners(c("rpart", "logreg", "randomForest"), type = "classif",
# predict.type = "prob")
measurements <- list(acc, ber, f1, auc)
bmr <- compare_sdg(lrns, measurement = measurements, target_var = "income",
real_dataset = adult_data,
generated_data1 = cart_elicit$gen_data,
generated_data2 = bn_learn$gen_data,
generated_data3 = bn_elicit$gen_data)
names(bmr$results) <- c("Real_dataset", "CART_elicit", "BN_learn", "BN_elicit")
We can see in this example, models trained by data from CART and BN_learn both have very similar predictive performance as those trained by real dataset.
bmr
## task.id learner.id acc.test.mean ber.test.mean f1.test.mean
## 1 Real_dataset classif.rpart 0.8401386 0.2742090 0.8997365
## 2 Real_dataset classif.logreg 0.8479511 0.2366056 0.9022321
## 3 CART_elicit classif.rpart 0.8092571 0.3724296 0.8862117
## 4 CART_elicit classif.logreg 0.7984965 0.3762002 0.8786615
## 5 BN_learn classif.rpart 0.7863355 0.3759166 0.8694321
## 6 BN_learn classif.logreg 0.8051297 0.3690590 0.8828430
## 7 BN_elicit classif.rpart 0.7845666 0.4103314 0.8719835
## 8 BN_elicit classif.logreg 0.7888414 0.3887949 0.8727967
## auc.test.mean
## 1 0.8430715
## 2 0.9026087
## 3 0.6285774
## 4 0.6223223
## 5 0.6259237
## 6 0.8447065
## 7 0.5896686
## 8 0.5469897