Creating and visualizing spatial simulations of tumor growth using SITH

Introduction

The cells within a cancerous tumor are usually highly diverse. An average tumor contains hundreds of thousands of mutations spread throughout billions of cancer cells, although it is thought that only a small percentage of these mutations are “drivers” which facilitate the progression of cancer into later stages (Greaves and Maley 2012). A lack of understanding about the evolutionary process which results in the observed intratumor heterogeneity is a major obstacle preventing the development of effective cancer therapies (Stanta and Bonin 2018).

Tumor growth is inherently a three-dimensional process. For this reason, the most accurate models of intratumor heterogeneity will factor in the effects that spatial growth has on the diversity of the tumor. One of the most popular models was introduced by Waclaw et. al. (2015), which models spatial tumor growth as a multi-type branching process where cells occupy sites on a 3D lattice. Similar models have been studied in more recent literature (See Chkhaidze et. al. (2019), Opasic et. al. (2018)). However, the software is either unavailable or designed to run at the command line, which is inconvenient for R users.

This package, a Spatial model of Intra-tumor Heterogeneity (SITH), implements a 3D lattice-based model of tumor growth (see Model details) that can be used entirely from the R console. The package allows for 3D interactive visualizations of the tumor, as well as a comprehensive summary of the spatial distribution of mutants within the tumor. SITH also contains functions allowing users to create simulated datasets from the generated tumor.

Model details

Tumor growth is modeled as an exponential birth-death process where cells occupy sites on the three-dimensional integer lattice. Each cell is given a birth rate \(b\) and a death rate \(d\) such that the time until cell replication or death is exponentially distributed with parameters \(b\) and \(d\), respectively. A cell can replicate if at least one of the six sites adjacent to it is unoccupied. Each time cell replication occurs, both daughter cells acquire \(Pois(u)\) mutations, for some chosen (usually small) \(u\). We follow the “infinite alleles model” (Kimura and Crow 1964) in assuming that each alteration occurs only once, so that each mutation event induces a unique allele in the population.

An alteration is a driver mutation with probability \(du\) and is a passenger otherwise. To model the selective advantage conferred to driver mutations, the birth rate of a cell depends on the number of driver mutations present in its genotype. Following a common assumption (Waclaw et. al. 2015), a cell with \(k\) driver mutations is given a birth rate \(bs^k\), where \(s \geq 1\) and \(b\) is the initial, or wild-type, birth rate. The death rate remains the same as the dynamics of the model are completely determined by \(b/d\). Since cancerous tumors are believed to often begin with a single mutated cell (Cooper 2000), the initial state of the model is a single cell at the origin with \(b > d\).

For more information on how the model is simulated, see the appendix.

Simulating spatial tumor growth

As always, we begin by loading the package and setting a seed.

set.seed(1126490984)
library(SITH)

The function simulateTumor() implements the lattice based model of tumor growth and mutation. \(N\) (population size) \(b,d,u,du,s\) are all inputs to the function, although they all have default values (see user manual).

out <- simulateTumor(N = 250000, verbose = FALSE)

simulateTumor() returns a list containing useful information for analyzing the results of the model.

names(out)
#> [1] "cell_ids"     "alleles"      "muts"         "phylo_tree"   "color_scheme" "drivers"      "time"         "params"
head(out$cell_ids)
#>     x   y   z allele nmuts distance
#> 1 -32  19  -4  41731     5 37.42993
#> 2  16  31   6      0     1 35.39774
#> 3 -30   8   7  33143     4 31.82766
#> 4  29 -23 -11  44432     6 38.61347
#> 5   5  22  21  41706     3 30.82207
#> 6 -14  10  19  41271     4 25.63201
head(out$muts)
#>   id  count      MAF
#> 1  0 250000 1.000000
#> 2  1      0 0.000000
#> 3  2 107680 0.430720
#> 4  3      0 0.000000
#> 5  4      0 0.000000
#> 6  5  25341 0.101364

As we can see, the cell_ids data frame contains the \((x,y,z)\) coordinates of the cells, the allele ID number, the number of mutations in each cell, and the Euclidean distance from the origin (computed for convenience). The muts data frame contains the ID number for each mutation and its corresponding mutation allele frequency (MAF) in the total population.

There is other useful information contained in out such as a phylogenetic tree so that an “order of mutations” can be defined. See the user manual for the details on all the returned components.

Visualization

SITH provides functions to visualize the simulated tumor. It is strongly recommended that the user has installed the rgl package. With rgl, the function visualizeTumor() will generate an interactive 3D plot of the simulated tumor. If rgl is not installed, then a static plot will be made with scatterplot3d().

visualizeTumor(out, background = "white")

Another option is to use plot.type = "heat", which colors cells on a scale from blue to red, depending on the number of mutations within the cell. This allows for the user to observe regions of high mutation counts.

visualizeTumor(out, background = "white", plot.type = "heat")

One can easily look inside the tumor by plotting a (static) cross-section with the plotSlice() function. One can slice in any coordinate direction and level with the slice.dim and level arguments (see the user manual).

par(mfrow = c(1,2))
plotSlice(tumor = out)
plotSlice(tumor = out, plot.type = "heat")

Spatial distribution of mutants

One of the main reasons for using a spatial simulation of tumor growth is to investigate biases in the distribution of mutants throughout the tumor. There are countless questions that can be asked, and hopefully the return value of simulateTumor() will give enough information to answer most of these. To get a quick summary of the spatial distribution of mutations, SITH includes spatialDistribution(). This includes a plot of the number of mutations as a function of Euclidean distance, a histogram of the number of mutations, and the mutations with the largest MAF. We can compare the similarity of two cells \(A\) and \(B\) by viewing their genotypes as a binary vector (with component \(i\) on if mutation \(i\) is present in the cell) and computing the jaccard index \(J(A,B) = |A \cap B|/|A \cup B|\). spatialDistribution() also estimates the average Jaccard index of cells as a function of the Euclidean distance between them.

sp <- spatialDistribution(tumor = out)

Note that the list sp also contains all of the raw data needed to generate the plots.

names(sp)
#> [1] "mean_mutant" "mean_driver" "jaccard"

Simulating sequencing data

An exciting application of the spatial models of intratumor heterogeneity is to use them to generate synthetic sequencing data sets.

Recently, there have been a myriad of algorithms designed to infer clonal composition or tumor phylogeny from single cell sequencing or bulk sequencing data sets (for example, Kuipers and Beerenwinkel (2016)). It is not well understood what the impact of spatially biased sampling methods will have on these methods. Using simulated data sets from spatial models could be helpful in determining which inference algorithms are likely to be useful in practice.

SITH contains several functions that simulate single cell sequencing and bulk sampling, allowing the user to get synthetic sequencing data sets from the tumor.

Single cell sequencing

randomSingleCells() will take random cells from the tumor and report the list of mutations present in each cell. Due to artifacts of sequencing technology, it is expected that there are a large number of false negatives. To account for this, the fnr parameter introduces false negatives into the data set at the specified rate (there is also a fpr parameter for false positives). The function returns a binary matrix, where a \(1\) indicates that a mutation is present.

Scs <- randomSingleCells(tumor = out, ncells = 5, fnr = 0.1)
head(Scs)
#>      mutID-0 mutID-2 mutID-27304 mutID-72723 mutID-5483 mutID-49311
#> SC-1       1       1           0           0          0           0
#> SC-2       1       0           0           0          0           0
#> SC-3       1       0           1           1          0           0
#> SC-4       0       0           0           0          0           0
#> SC-5       1       1           0           0          1           1

The user can also sequence a single cell at a specified \((x,y,z)\) location using the singleCell() function with argument pos = (x,y,z). See the user manual for details.

Bulk sampling

One can also simulate bulk sampling by taking a collection of cells and calculating the variant allele frequency (VAF) of each mutation. One way to define a collection of cells is to take a \(n \times n \times n\) cube. Function randomBulkSamples will choose random locations for the cubes. Note that argument cube.length must be odd in order for the cube to have a well-defined center. In practice, mutations that fall a certain VAF (\(\approx 0.05\)) are either ignored or undetected due to technological artifacts. To account for this, the threshold argument will ignore mutations below a certain VAF. The return of the function is a matrix of VAFs.

Bulks <- randomBulkSamples(tumor = out, nsamples = 5, cube.length = 7, threshold = 0.05)
head(Bulks)
#>          mutID-0   mutID-2 mutID-6845 mutID-8724 mutID-19947 mutID-30325    mutID-5  mutID-633 mutID-3339 mutID-414 mutID-5117 mutID-30149 mutID-2217 mutID-5142 mutID-7104 mutID-21006 mutID-22691 mutID-27447 mutID-941 mutID-1840 mutID-28010 mutID-5655 mutID-6297 mutID-17606 mutID-4164 mutID-7355 mutID-738  mutID-187  mutID-597 mutID-5970 mutID-3072 mutID-6096 mutID-2507 mutID-3952 mutID-15681 mutID-17503 mutID-9068
#> Bulk-1 0.8279883 0.6822157 0.09912536 0.09912536  0.09329446  0.06705539 0.09037901 0.08163265 0.08163265  0.148688 0.05247813  0.05830904  0.0845481  0.0000000  0.0000000   0.0000000   0.0000000   0.0000000 0.0000000  0.0000000  0.00000000  0.0000000  0.0000000   0.0000000   0.000000  0.0000000  0.000000 0.00000000 0.00000000 0.00000000  0.0000000 0.00000000   0.000000  0.0000000   0.0000000  0.00000000 0.00000000
#> Bulk-2 0.5568513 0.1924198 0.00000000 0.00000000  0.00000000  0.00000000 0.10787172 0.00000000 0.00000000  0.000000 0.00000000  0.00000000  0.0000000  0.2565598  0.2565598   0.1428571   0.1428571   0.1137026 0.1865889  0.1749271  0.08746356  0.0000000  0.0000000   0.0000000   0.000000  0.0000000  0.000000 0.00000000 0.00000000 0.00000000  0.0000000 0.00000000   0.000000  0.0000000   0.0000000  0.00000000 0.00000000
#> Bulk-3 0.7259475 0.0000000 0.00000000 0.00000000  0.00000000  0.00000000 0.00000000 0.00000000 0.00000000  0.000000 0.00000000  0.00000000  0.0000000  0.0000000  0.0000000   0.0000000   0.0000000   0.0000000 0.0000000  0.0000000  0.00000000  0.1282799  0.1282799   0.1253644   0.148688  0.1457726  0.000000 0.00000000 0.00000000 0.00000000  0.0000000 0.00000000   0.000000  0.0000000   0.0000000  0.00000000 0.00000000
#> Bulk-4 0.7696793 0.3411079 0.00000000 0.00000000  0.00000000  0.00000000 0.34110787 0.00000000 0.00000000  0.000000 0.00000000  0.00000000  0.0000000  0.0000000  0.0000000   0.0000000   0.0000000   0.0000000 0.0000000  0.0000000  0.00000000  0.0000000  0.0000000   0.0000000   0.000000  0.0000000  0.244898 0.08746356 0.08746356 0.05247813  0.0000000 0.00000000   0.000000  0.0000000   0.0000000  0.00000000 0.00000000
#> Bulk-5 0.7725948 0.2973761 0.00000000 0.00000000  0.00000000  0.00000000 0.00000000 0.00000000 0.00000000  0.000000 0.00000000  0.00000000  0.0000000  0.0000000  0.0000000   0.0000000   0.0000000   0.0000000 0.0000000  0.0000000  0.00000000  0.0000000  0.0000000   0.0000000   0.000000  0.0000000  0.000000 0.00000000 0.00000000 0.00000000  0.2390671 0.09912536   0.244898  0.1632653   0.1253644  0.06705539 0.09620991

If instead one would like to choose the location of the cube, bulkSample() can be used and pos can be set to the cube center.

One can also simulate a long, thin needle passing through the tumor to collect a sample, as described in Chkhaidze et. al. (2019). This is implemented as randomNeedles() and takes the same arguments as randomBulkSamples (minus cube.length).

Needs <- randomNeedles(tumor = out, nsamples = 5, threshold = 0.05)
head(Needs)
#>        mutID-0   mutID-2 mutID-289 mutID-1783    mutID-5 mutID-415 mutID-12435   mutID-16 mutID-36810  mutID-187  mutID-597 mutID-7543 mutID-9743 mutID-1579 mutID-2066 mutID-414  mutID-82 mutID-317 mutID-6845 mutID-8724 mutID-4538 mutID-331 mutID-2723 mutID-969 mutID-224 mutID-2847 mutID-9608 mutID-10538 mutID-6855 mutID-183 mutID-1975 mutID-11576 mutID-21976  mutID-941 mutID-13546
#> Bulk-1       1 0.9361702 0.7872340   0.106383 0.06382979 0.4255319    0.106383 0.06382979    0.106383 0.00000000 0.00000000 0.00000000 0.00000000   0.000000   0.000000  0.000000 0.0000000 0.0000000   0.000000   0.000000  0.0000000 0.0000000  0.0000000    0.0000    0.0000     0.0000 0.00000000  0.00000000     0.0000 0.0000000  0.0000000  0.00000000  0.00000000 0.00000000           0
#> Bulk-2       1 0.6326531 0.0000000   0.000000 0.30612245 0.0000000    0.000000 0.00000000    0.000000 0.06122449 0.06122449 0.06122449 0.06122449   0.122449   0.122449  0.122449 0.1836735 0.1836735   0.122449   0.122449  0.0000000 0.0000000  0.0000000    0.0000    0.0000     0.0000 0.00000000  0.00000000     0.0000 0.0000000  0.0000000  0.00000000  0.00000000 0.00000000           0
#> Bulk-3       1 0.6666667 0.1041667   0.000000 0.22916667 0.0625000    0.000000 0.08333333    0.000000 0.00000000 0.00000000 0.00000000 0.00000000   0.000000   0.000000  0.000000 0.0000000 0.0000000   0.000000   0.000000  0.1458333 0.1041667  0.1041667    0.1875    0.0625     0.0625 0.08333333  0.08333333     0.0625 0.0000000  0.0000000  0.00000000  0.00000000 0.00000000           0
#> Bulk-4       1 0.7894737 0.0000000   0.000000 0.00000000 0.0000000    0.000000 0.00000000    0.000000 0.00000000 0.00000000 0.00000000 0.00000000   0.000000   0.000000  0.000000 0.0000000 0.0000000   0.000000   0.000000  0.0000000 0.0000000  0.0000000    0.0000    0.0000     0.0000 0.00000000  0.00000000     0.0000 0.2236842  0.1052632  0.09210526  0.09210526 0.09210526           0
#> Bulk-5       1 0.0000000 0.0000000   0.000000 1.00000000 0.0000000    0.000000 0.00000000    0.000000 0.00000000 0.00000000 0.00000000 0.00000000   0.000000   0.000000  0.000000 0.0000000 0.0000000   0.000000   0.000000  1.0000000 0.0000000  0.0000000    0.0000    0.0000     0.0000 0.00000000  0.00000000     0.0000 0.0000000  0.0000000  0.00000000  0.00000000 0.00000000           1

There is currently no function allowing the user to select the location of the user, although this could be included in future versions.

Appendix

The simulation algorithm

The model is simulated using a Gillespie algorithm (Gillespie 1977). Given a population of \(N\) cells at time \(t\), cell \(i\) is chosen to replicate with probability \(b_i/\sum_{j=1}^N (b_j + d_j)\) (provided a free space is available) and die with probability \(d_i/\sum_{j=1}^N(b_j + d_j)\). After an event is selected, the time is updated to be \(t + X\), where \[ X \sim Expo \left( \sum_{j=1}^N b_j + d_j \right)\] Our simulation algorithm approximates the parameter of the exponential distribution with \(p_{max} = N \max_j (b_j + d_j)\).

A standard approach is to use inverse transform sampling to select a cell for replication or death. However, this requires computing a cumulative sum over the birth and death rates for all unique alleles in the population, which is likely to scale linearly with \(N\). We circumvent this issue by using rejection sampling. On each iteration, a random cell \(i\) is selected uniformly from the population. Next, we obtain a sample \(u\) from the distribution over \([0, p_{max}]\), where \(p_{max}\) is as above. If \(u < b_i + d_i\), then cell \(i\) is selected to replicate or die. Otherwise, we proceed to the next iteration. Although there are contrived examples where rejection sampling is inefficient, the expected run-time is nearly constant for reasonable parameter values.

Session information

sessionInfo()
#> R version 4.0.0 (2020-04-24)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Catalina 10.15.5
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] SITH_1.0.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.4.6            knitr_1.28              magrittr_1.5            scatterplot3d_0.3-41    xtable_1.8-4            R6_2.4.1                rlang_0.4.6             fastmap_1.0.1           stringr_1.4.0           tools_4.0.0             webshot_0.5.2           xfun_0.14               miniUI_0.1.1.1          htmltools_0.4.0         crosstalk_1.1.0.1       yaml_2.2.1              digest_0.6.25           rgl_0.100.54            shiny_1.4.0.2           later_1.1.0.1           htmlwidgets_1.5.1       promises_1.1.0          manipulateWidget_0.10.1 evaluate_0.14           mime_0.9                rmarkdown_2.2           stringi_1.4.6           compiler_4.0.0          jsonlite_1.6.1          httpuv_1.5.4