Simulating study data: introduction

Keith S. Goldfeld

2020-03-31

Simulation using simstudy has two primary steps. First, the user defines the data elements of a data set. Second, the user generates the data, using the definitions in the first step. Additional functionality exists to simulate observed or randomized treatment assignment/exposures, to generate survival data, to create longitudinal/panel data, to create multi-level/hierarchical data, to create datasets with correlated variables based on a specified covariance structure, to merge datasets, to create data sets with missing data, and to create non-linear relationships with underlying spline curves.

This vignette provides a brief introduction to the basics of generating data. For information on more elaborate data generating mechanisms, please visit https://www.rdatagen.net/page/simstudy/.

Defining the Data

The key to simulating data in simstudy is the creation of series of data definition tables that look like this:

varname formula variance dist link
nr 7 0e+00 nonrandom identity
x1 10;20 0e+00 uniform identity
y1 nr + x1 * 2 8e+00 normal identity
y2 nr - 0.2 * x1 0e+00 poisson log
xnb nr - 0.2 * x1 5e-02 negBinomial log
xCat 0.3;0.2;0.5 0e+00 categorical identity
g1 5+xCat 1e+00 gamma log
b1 1+0.3*xCat 1e+00 beta logit
a1 -3 + xCat 0e+00 binary logit
a2 -3 + xCat 1e+02 binomial logit

These definition tables can be generated two ways. One option is to to use any external editor that allows the creation of csv files, which can be read in with a call to defRead. An alternative is to make repeated calls to the function defData. Here, we illustrate the R code that builds this definition table internally:

def <- defData(varname = "nr", dist = "nonrandom", formula = 7, id = "idnum")
def <- defData(def, varname = "x1", dist = "uniform", formula = "10;20")
def <- defData(def, varname = "y1", formula = "nr + x1 * 2", variance = 8)
def <- defData(def, varname = "y2", dist = "poisson", formula = "nr - 0.2 * x1", 
    link = "log")
def <- defData(def, varname = "xnb", dist = "negBinomial", formula = "nr - 0.2 * x1", 
    variance = 0.05, link = "log")
def <- defData(def, varname = "xCat", formula = "0.3;0.2;0.5", dist = "categorical")
def <- defData(def, varname = "g1", dist = "gamma", formula = "5+xCat", variance = 1, 
    link = "log")
def <- defData(def, varname = "b1", dist = "beta", formula = "1+0.3*xCat", variance = 1, 
    link = "logit")
def <- defData(def, varname = "a1", dist = "binary", formula = "-3 + xCat", 
    link = "logit")
def <- defData(def, varname = "a2", dist = "binomial", formula = "-3 + xCat", 
    variance = 100, link = "logit")

The first call to defData without specifying a definition name (in this example the definition name is def) creates a new data.table with a single row. An additional row is added to the table def each time the function defData is called. Each of these calls is the definition of a new field in the data set that will be generated. In this example, the first data field is named ‘nr’, defined as a constant with a value to be 7. In each call to defData the user defines a variable name, a distribution (the default is ‘normal’), a mean formula (if applicable), a variance parameter (if applicable), and a link function for the mean (defaults to ‘identity’).

The possible distributions include normal, gamma, poisson, zero-truncated poisson, negative binomial, binary, binomial, beta, uniform, uniform integer, categorical, and deterministic/non-random. For all of these distributions, key parameters defining the distribution are entered in the formula, variance, and link fields.

In the case of the normal, gamma, beta, and negative binomial distributions, the formula specifies the mean. The formula can be a scalar value (number) or a string that represents a function of previously defined variables in the data set definition (or, as we will see later, in a previously generated data set). In the example, the mean of y1, a normally distributed value, is declared as a linear function of nr and x1, and the mean of g1 is a function of the category defined by xCat. The variance field is defined only for normal, gamma, beta, and negative binomial random variables, and can only be defined as a scalar value. In the case of gamma, beta, and negative binomial variables, the value entered in variance field is really a dispersion value \(d\). The variance of a gamma distributed variable will be \(d \times mean^2\), for a beta distributed variable will be \(mean \times (1- mean)/(1 + d)\), and for a negative binomial distributed variable, the variance will be \(mean + d*mean^2\).

In the case of the poisson, zero-truncated poisson, and binary distributions, the formula also specifies the mean. The variance is not a valid parameter in these cases, but the link field is. The default link is ‘identity’ but a ‘log’ link is available for the Poisson distributions and a “logit” link is available for the binary outcomes. In this example, y2 is defined as Poisson random variable with a mean that is function of nr and x1 on the log scale. For binary variables, which take a value of 0 or 1, the formula represents probability (with the ‘identity’ link) or log odds (with the ‘logit’ link) of the variable having a value of 1. In the example, a1 has been defined as a binary random variable with a log odds that is a function of xCat.

In the case of the binomial distribution, the formula specifies the probability of success \(p\), and the variance field is used to specify the number of trials \(n\). The mean of this distribution is \(n*p\), and the variance is \(n*p*(1-p)\).

Variables defined with a uniform, uniform integer, categorical, or deterministic/non-random distribution are specified using the formula only. The variance and link fields are not used in these cases.

For a uniformly distributed variable, The formula is a string with the format “a;b”, where a and b are scalars or functions of previously defined variables. The uniform distribution has two parameters - the minimum and the maximum. In this case, a represents the minimum and b represents the maximum.

For a categorical variable with \(k\) categories, the formula is a string of probabilities that sum to 1: “\(p_1 ; p_2 ; ... ; p_k\)”. \(p_1\) is the probability of the random variable falling category 1, \(p_2\) is the probability of category 2, etc. The probabilities can be specified as functions of other variables previously defined. In the example, xCat has three possibilities with probabilities 0.3, 0.2, and 0.5, respectively.

Non-random variables are defined by the formula. Since these variables are deterministic, variance is not relevant. They can be functions of previously defined variables or a scalar, as we see in the sample for variable defined as nr.

Generating the Data

After the data set definitions have been created, a new data set with \(n\) observations can be created with a call to function genData. In this example, 1,000 observations are generated using the data set definitions in def, and then stored in the object dt:

dt <- genData(1000, def)
dt
##       idnum nr       x1       y1  y2 xnb xCat        g1        b1 a1 a2
##    1:     1  7 18.71470 48.13110  25  36    3  882.3611 0.9707256  1 48
##    2:     2  7 12.63977 34.82680  87  97    3 1986.9499 0.8497208  1 54
##    3:     3  7 13.21247 34.96022  80  71    2 1460.2205 0.7136439  0 26
##    4:     4  7 19.21613 38.93975  17  16    1   77.6381 0.9997095  0 14
##    5:     5  7 10.70988 24.16021 148 110    2  696.4741 0.9546023  0 32
##   ---                                                                  
##  996:   996  7 12.69114 34.43474  88 117    1  480.9245 0.3714678  0 10
##  997:   997  7 11.48129 31.34903 108 125    1  235.4808 0.9427933  0 13
##  998:   998  7 16.88184 41.60436  45  50    3 2425.0456 0.9999994  1 51
##  999:   999  7 10.24263 25.36589 151 107    3 2537.5048 0.9978473  0 49
## 1000:  1000  7 12.72076 33.53079  78  70    1  605.9685 0.4719441  0 13

New data can be added to an existing data set with a call to function addColumns. The new data definitions are created with a call to defData and then included as an argument in the call to addColumns:

addef <- defDataAdd(varname = "zExtra", dist = "normal", formula = "3 + y1", 
    variance = 2)

dt <- addColumns(addef, dt)
dt
##       idnum nr       x1       y1  y2 xnb xCat        g1        b1 a1 a2
##    1:     1  7 18.71470 48.13110  25  36    3  882.3611 0.9707256  1 48
##    2:     2  7 12.63977 34.82680  87  97    3 1986.9499 0.8497208  1 54
##    3:     3  7 13.21247 34.96022  80  71    2 1460.2205 0.7136439  0 26
##    4:     4  7 19.21613 38.93975  17  16    1   77.6381 0.9997095  0 14
##    5:     5  7 10.70988 24.16021 148 110    2  696.4741 0.9546023  0 32
##   ---                                                                  
##  996:   996  7 12.69114 34.43474  88 117    1  480.9245 0.3714678  0 10
##  997:   997  7 11.48129 31.34903 108 125    1  235.4808 0.9427933  0 13
##  998:   998  7 16.88184 41.60436  45  50    3 2425.0456 0.9999994  1 51
##  999:   999  7 10.24263 25.36589 151 107    3 2537.5048 0.9978473  0 49
## 1000:  1000  7 12.72076 33.53079  78  70    1  605.9685 0.4719441  0 13
##         zExtra
##    1: 53.34158
##    2: 35.55258
##    3: 39.74581
##    4: 38.61562
##    5: 27.66564
##   ---         
##  996: 39.48614
##  997: 35.11438
##  998: 43.95011
##  999: 27.75513
## 1000: 38.88947

Generating the Treatment/Exposure

Treatment assignment can be accomplished through the original data generation process, using defData and genData. However, the functions trtAssign and trtObserve provide more options to generate treatment assignment.

Assigned treatment

Treatment assignment can simulate how treatment is made in a randomized study. Assignment to treatment groups can be (close to) balanced (as would occur in a block randomized trial); this balancing can be done without or without strata. Alternatively, the assignment can be left to chance without blocking; in this case, balance across treatment groups is not guaranteed, particularly with small sample sizes.

First, create the data definition:

Balanced treatment assignment, stratified by gender and age category (not blood pressure)

##      male over65 cid  baseDBP rxGrp
##   1:    1      0   1 69.71811     3
##   2:    0      0   2 68.21481     1
##   3:    1      0   3 63.64589     1
##   4:    0      0   4 67.40492     2
##   5:    0      0   5 72.96366     2
##  ---                               
## 326:    1      1 326 67.99205     1
## 327:    1      0 327 55.73555     2
## 328:    1      0 328 74.85385     2
## 329:    1      0 329 75.20651     2
## 330:    1      0 330 66.24796     1

Balanced treatment assignment (without stratification)

Random (unbalanced) treatment assignment