Data science applications need data to prototype and demonstrate to potential clients. For such purposes, using production data is a possibility. However, it is not always feasible due to legal and/or ethical considerations(By Team Nuggets, n.d.). This resulted in a need for generating synthetic data. This need is the key motivator for the package conjurer.
Data across multiple domains are known to exhibit some form of seasonality, cyclicality and trend. Although there are synthetic data generation packages currently available, they focus primarily on synthetic versions of microdata containing confidential information or for machine learning purposes. There is a need for a more generic synthetic data generation package that helps for multiple purposes such as forecasting, customer segmentation, insight generation etc. This package conjurer helps in generating such synthetic data.
Let us consider an example of generating transactional data for a retail store. The following steps will help in building such data.
Install conjurer package by using the following code. Since the package uses base R functions, it does not have any dependencies.
A customer is identified by a unique customer identifier(ID). A customer ID is alphanumeric with prefix “cust” followed by a numeric. This numeric ranges from 1 and extend to the number of customers provided as the argument within the function. For example, if there are 100 customers, then the customer ID will range from cust001 to cust100. This ensures that the customer ID is always of the same length. Let us build a group of customer IDs using the following code. For simplicity, let us assume that there are 100 customers. customer ID is built using the function buildCust. This function takes one argument “numOfCust” that specifies the number of customer IDs to be built.
library(conjurer)
customers <- buildCust(numOfCust = 100)
print(head(customers))
#> [1] "cust001" "cust002" "cust003" "cust004" "cust005" "cust006"
A list of customer names for the 100 customer IDs can be generated in the following way.
Let us assign customer names to customer IDs. This is a random one to one mapping using the following code.
The next step is building some products. A product is identified by a product ID. Similar to a customer ID, a product ID is also an alphanumeric with prefix “sku” which signifies a stock keeping unit. This prefix is followed by a numeric ranging from 1 and extending to the number of products provided as the argument within the function. For example, if there are 10 products, then the product ID will range from sku01 to sku10. This ensures that the product ID is always of the same length. Besides product ID, the product price range must be specified. Let us build a group of products using the following code. For simplicity, let us assume that there are 10 products and the price range for them is from 5 dollars to 50 dollars. Products are built using the function buildProd. This function takes 3 arguments as given below.
Now that a group of customer IDs and Products are built, the next step is to build transactions. Transactions are built using the function genTrans. This function takes 5 arguments. The details of them are as follows.
Let us build transactions using the following code
Visualize generated transactions by using
TxnAggregated <- aggregate(transactions$transactionID, by = list(transactions$dayNum), length)
plot(TxnAggregated, type = "l", ann = FALSE)
Bringing customers, products and transactions together is the final step of generating synthetic data. This process entails 3 steps as given below.
The allocation of transactions is achieved with the help of buildPareto function. This function takes 3 arguments as detailed below.
Let us now allocate transactions to customers first by using the following code.
Assign readable names to the output by using the following code.
Now, using similar step as mentioned above, allocate transactions to products using following code.
product2transaction <- buildPareto(products$SKU,transactions$transactionID,pareto = c(70,30))
names(product2transaction) <- c('transactionID', 'SKU')
#inspect the output
print(head(product2transaction))
#> transactionID SKU
#> 1 txn-4-15 sku05
#> 2 txn-115-10 sku07
#> 3 txn-310-11 sku03
#> 4 txn-358-38 sku03
#> 5 txn-18-11 sku05
#> 6 txn-246-18 sku07
The following code brings together transactions, products and customers into one dataframe.
df1 <- merge(x = customer2transaction, y = product2transaction, by = "transactionID")
df2 <- merge(x = df1, y = transactions, by = "transactionID", all.x = TRUE)
#inspect the output
print(head(df2))
#> transactionID customer SKU dayNum mthNum
#> 1 txn-1-01 cust062 sku05 1 1
#> 2 txn-1-02 cust041 sku07 1 1
#> 3 txn-1-03 cust051 sku07 1 1
#> 4 txn-1-04 cust093 sku07 1 1
#> 5 txn-1-05 cust001 sku02 1 1
#> 6 txn-1-06 cust084 sku05 1 1
We can add additional data such as customer name using the code below.
dfFinal <- merge(x = df2, y = customer2name, by.x = "customer", by.y = "customers", all.x = TRUE)
#inspect the output
print(head(dfFinal))
#> customer transactionID SKU dayNum mthNum customerName
#> 1 cust001 txn-272-10 sku05 272 9 belarie
#> 2 cust001 txn-143-3 sku07 143 5 belarie
#> 3 cust001 txn-249-31 sku03 249 9 belarie
#> 4 cust001 txn-153-18 sku05 153 6 belarie
#> 5 cust001 txn-290-12 sku07 290 10 belarie
#> 6 cust001 txn-176-08 sku01 176 6 belarie
Thus, we have the final data set with transactions, customers and products.
The column names of the final data frame can be interpreted as follows. + Each row is a transaction and the data frame has all the transactions for a year i.e 365 days. + transactionID is the unique identifier for that transaction. + customer is the unique customer identifier. This is the customer who made that transaction. + SKU is the product that was bought in that transaction. + dayNum is the day number in the year. There would be 365 unique dayNum in the data frame. + mthNum is the month number. This ranges from 1 to 12 and represents January to December respectively. + customerName is name of the customer.
Let us visualize the results to understand the data distribution.
Below is a view of the sum of transactions by each day.
aggregatedDataDay <- aggregate(dfFinal$transactionID, by = list(dfFinal$dayNum), length)
plot(aggregatedDataDay, type = "l", ann = FALSE)
Below is a view of the sum of transactions by each month.
aggregatedDataMth <- aggregate(dfFinal$transactionID, by = list(dfFinal$mthNum), length)
aggregatedDataMthSorted <- aggregatedDataMth[order(aggregatedDataMth$Group.1),]
plot(aggregatedDataMthSorted, ann = FALSE)
By Team Nuggets, Microsoft -. n.d. https://www.cbtnuggets.com/blog/certifications/microsoft/how-to-generate-dummy-data-for-your-database.