ActiveDriverWGS

Helen Zhu, Dr. Juri Reimand

2019-03-18

Introduction

Cancer is driven by somatic mutations. A few of these mutations confer a survival advantage to the tumour (drivers) while most mutations play a passive role in tumour development (passengers). Most driver mutations up to this point have been discovered in protein coding genes in whole exome sequencing datasets. Yet, as whole genome sequencing datasets grow and become the standard, novel methods are required to detect driver mutations in the vast noncoding regulatory genome. Here, we present our solution to this challenge, ActiveDriverWGS.

The ActiveDriverWGS Model

ActiveDriverWGS is a recurrence-based method which builds on the idea that driver mutations are subject to positive selection. This method analyzes the mutational burden of SNVs and short indels (less than 50bps) in functionally defined elements. These elements can include the coding regions of protein coding genes and noncoding regulatory regions such as promoters, enhancers and untranslated regions.

Optionally, the tested genomic elements may also include active sites of interest which reside within the elements themselves. Examples of such active sites include post-translational modification sites in protein coding genes, miRNA binding sites in mRNA sequences and transcription factor binding sites in enhancers. If sites are specified, enrichment of site mutations are estimated in elements with enriched mutational burden using the same model comparing sites to element.

ActiveDriverWGS uses a Poisson generalized linear regression to compare the mutational burden of elements against the expected mutational burden of a narrow background window (Figure 1). In our work, we have optimized the background to be 50,000 bps upstream and downstream of elements. ActiveDriverWGS also incorporates the effect of mutational signatures, specifically the probability distribution of SNVs arising across trinucleotide contexts which vary with mutational processes. If an element is segmented, such as the exons of a protein coding gene, the inter-segment sequences are also used to calculate the expected background rate, up to +/- 50 kbps around every segment. Users also have the option to exclude hyper-mutated samples which decrease the accuracy of driver discovery.

Publication

For a more detailed reference on ActiveDriverWGS, please refer to the preprint.

Please Note

The genome build for ActiveDriverWGS is hg19. We are working on adding additional options for hg38.

The ActiveDriverWGS Model

The ActiveDriverWGS Model

Input Data

ActiveDriverWGS requires a file for somatic mutations and a file for genomic elements. A third optional file for sites can be specified by the user. Elements may contain multiple segments, each represented in a separate row. Segments belonging to the same element must share a common id unique to the element. Sites will only be incorporated in the model if they reside within elements but not all elements need to contain sites. Elements may contain multiple sites. Site ids are independent of element ids.

Mutations

The mutations data must be in a data frame containing the columns with the correct column names chr, pos1, pos2, ref, alt, and patient. Additional columns may be included but will not be analyzed.

  1. chr: autosomal chromosomes as chr1 to chr22 and sex chromosomes as chrX and chrY

  2. pos1: the start position of the mutation in base 1 coordinates

  3. pos2: the end position of the mutation in base 1 coordinates

  4. ref: the reference allele as a string containing the bases A, T, C or G

  5. alt: the alternate allele as a string containing the bases A, T, C or G

  6. patient: the patient identifier as a string

Elements and Sites

The elements and sites data must be in data frames containing the columns with the correct column names chr, start, end, and id. Additional columns may be included but will not be analyzed.

  1. chr: autosomal chromosomes as chr1 to chr22 and sex chromosomes as chrX and chrY

  2. start: the start position of the element or site in base 0 coordinates (BED format)

  3. end: the end position of the element or site in base 0 coordinates (BED format)

  4. id: the element identifier - if the element contains multiple segments such as exons, each segment should be a separate row with the segment coordinates and the element identifier as id. Elements can be coding or noncoding such as exons of protein coding genes or active enhancers.

Example

library(ActiveDriverWGS)

data("cll_mutations")
head(cll_mutations)
##          chr      pos1      pos2 ref alt       patient
## 779701  chr6  96651182  96651183  AA   T 001-0002-03TD
## 779702 chr10 106556005 106556005   C   T 001-0002-03TD
## 779703 chr10 107457352 107457352   C   T 001-0002-03TD
## 779704 chr10 108111334 108111334   C   A 001-0002-03TD
## 779705 chr10 109605024 109605024   T   A 001-0002-03TD
##  [ reached getOption("max.print") -- omitted 1 row ]
data("cancer_genes")
head(cancer_genes)
##      chr   start     end       id
## 648 chr1 2488103 2488172 TNFRSF14
## 649 chr1 2489164 2489273 TNFRSF14
## 650 chr1 2489781 2489907 TNFRSF14
## 651 chr1 2491261 2491417 TNFRSF14
## 652 chr1 2492062 2492153 TNFRSF14
## 653 chr1 2492932 2492963 TNFRSF14
data("cancer_gene_sites")
head(cancer_gene_sites)
##         chr    start      end                  id
## 337831 chr1 11169412 11169412 MTOR NM_004958 2488
## 337834 chr1 11169413 11169413 MTOR NM_004958 2488
## 337837 chr1 11169414 11169414 MTOR NM_004958 2487
## 337839 chr1 11169415 11169415 MTOR NM_004958 2487
## 337841 chr1 11169416 11169416 MTOR NM_004958 2487
## 337843 chr1 11169418 11169418 MTOR NM_004958 2486

In the above example, sites have the same start and end coordinates. While this is true for these active sites, this is not a necessity. The only requirement is that sites need to be contained within elements.

Importing BED12 Files as Input Regions

For elements and sites written in BED12 files, the prepare_elements_from_BED12 function can be used to read the file and adapt it to fulfill the format requirements for the elements and sites parameters of the ActiveDriverWGS function. For more information on the BED12 format, please refer to the UCSC guidelines. In this example, elements are adapted from annotations for protein coding genes from GENCODE.v19 for chromosome 17.

elements = prepare_elements_from_BED12(
  system.file(
    "extdata", 
    "chr17.coding_regions.bed", 
    package = "ActiveDriverWGS", 
    mustWork = TRUE))
## 
##  1189  Rows :: Processing row 100  200  300  400  500  600  700  800  900  1000  1100  
##  Preparing Elements Complete 
##  RM 0 lines
head(elements)
##     chr starts  ends                                             id
## 1 chr17   6006  6168 gc19_pc.cds::gencode::DOC2B::ENSG00000272636.1
## 2 chr17  11205 11332 gc19_pc.cds::gencode::DOC2B::ENSG00000272636.1
## 3 chr17  11871 11981 gc19_pc.cds::gencode::DOC2B::ENSG00000272636.1
## 4 chr17  13920 13995 gc19_pc.cds::gencode::DOC2B::ENSG00000272636.1
## 5 chr17  22327 22407 gc19_pc.cds::gencode::DOC2B::ENSG00000272636.1
## 6 chr17  30897 31270 gc19_pc.cds::gencode::DOC2B::ENSG00000272636.1

Basic Use

ActiveDriverWGS can be run simply with the mutations file, the elements file and an optional sites file. In this example, mutations are adapted from the Alexandrov et al, 2013 dataset for chronic lymphocytic leukemia (CLL) patients. Regions are adapted from the cancer gene census and annotations for protein coding genes are adapted from GENCODE.v19.

some_genes = c("ATM", "MYD88", "NOTCH1")

results = ActiveDriverWGS(mutations = cll_mutations,
                          elements = cancer_genes[cancer_genes$id %in% some_genes,],
                          sites = cancer_gene_sites)
## 0 remove hypermut, n= 0 ,  0 %
## hypermuted samples:   
## 
## reversing 0 positions
## Removing  0  invalid SNVs & indels
## 
## Tests to do:  3 
## ...

Parameter Interpretation

ActiveDriverWGS has several adjustable parameters:

  1. window_size: A narrow background window both upstream and downstream of the element in which mutation rates are assumed to remain constant. We have optimized this parameter on the PCAWG dataset to be 50,000 bps for SNVs and indels.

  2. filter_hyper_MB: The threshold for the number of simple somatic mutations per megabase above which a sample is considered hypermutated. We define the default to be 30 mutations/megabase according to published literature.

  3. recovery.dir: The directory for writing recovery files for ActiveDriverWGS. If the directory does not exist, it will be created. If the parameter is unspecified, recovery files will not be saved. As an ActiveDriverWGS query for large datasets may be computationally heavy, specifying a recovery directory will recover previously computed results if a query is interrupted.

  4. mc.cores: The number of cores that the user wishes to allocate to running ActiveDriverWGS. For more information, refer to the R package parallel.

Interpreting the Results

results
##       id  pp_element element_muts_obs element_muts_exp element_enriched
## 1    ATM 0.001807467                2                0             TRUE
## 2  MYD88 0.004976353                1                0             TRUE
##   pp_site site_muts_obs site_muts_exp site_enriched fdr_element fdr_site
## 1       1            NA            NA            NA 0.005422402        1
## 2       1            NA            NA            NA 0.005641867        1
##   has_site_mutations
## 1                   
## 2                   
##  [ reached getOption("max.print") -- omitted 1 row ]

ActiveDriverWGS will return results in a data frame format with the following columns.

  1. id: The identifier for the element.

  2. pp_element: The p-value associated with enrichment of mutations in the element. A value of NA indicates that no mutations fall within the element.

  3. element_muts_obs: Number of patients with mutations in the element. A value of NA indicates that no mutations fall within the element.

  4. element_muts_exp: Number of expected patients with mutations in the element. A value of NA indicates that no mutations fall within the element.

  5. element_enriched: Boolean indicating whether an enrichment of mutations in the element is observed. A value of NA indicates that no mutations fall within the element.

  6. pp_site: The p-value associated with enrichment of mutations in the sites.

  7. site_muts_obs: Number of patients with mutations in the sites. A value of 0 means that sites exist but are unaffected by mutations. A value of NA indicates that no site resides within the element.

  8. site_muts_exp: Number of expected patients with mutations in the sites. A value of 0 means that sites exist but are unaffected by mutations. A value of NA indicates that no site resides within the element.

  9. site_enriched: Boolean indicating whether an enrichment of mutations in the sites is observed. A value of NA indicates that no site resides within the element.

  10. fdr_element: FDR corrected p-value associated with the element.

  11. fdr_site: FDR corrected p-value associated with the site.

  12. has_site_mutations: “V” indicating the presence of site mutations. An empty string indicates that no site mutations are present.

Multiple Testing Corrections

In ActiveDriverWGS, multiple testing is performed for all given regions. We encourage the users to filter by FDR corrected values rather than p-values to eliminate false positives. FDR correction makes the conservative assumption that genes/regions with 0 mutations have p=1. FDR correction for sites is conducted over a restricted set of hypotheses, comprising genes/regions that have a significant enrichment of mutations (FDR<0.05) at the genes/regions levels.

Adapting ActiveDriverWGS to High Performance Computing Clusters

Compute time increases with the number of samples, mutations and regions. Hence, the two main functions integral to ActiveDriverWGS have also been made available in the public domain and can be adapted by the user to their local high performance computing clusters (HPCCs).

1. format_muts

This function formats the mutations data frame, removes hyper-mutated samples and removes non-mitochondrial mutations in extrachromosomal regions. It adds an additional column to the mutations data frame that provides the trinucleotide context of the given mutation which will be later used to estimate the mutational distribution across signatures.

2. ADWGS_test

This function calculates the enrichment of mutations for a particular region id. It applies a Poisson generalized linear regression model across mutation signatures to identify enriched regions.

Example

The following example demonstrates how to build an ActiveDriverWGS pipeline which can be adapted to HPCCs. The idea is that the list of ids can be split into manageable pieces and run in parallel jobs. Note that the creation of GRanges objects is part of the ActiveDriverWGS wrapper function and must be completed manually by users wishing to create personalized pipelines. Also, note that multiple testing corrections will need to be recalculated by the user.

library(GenomicRanges)

# Loading elements & creating a GRanges object
data(cancer_genes)
gr_element_coords = GRanges(seqnames = cancer_genes$chr,
                            IRanges(start = cancer_genes$start,
                                    end = cancer_genes$end),
                            mcols = cancer_genes$id)

# Loading sites & creating a GRanges object
data(cancer_gene_sites)
gr_site_coords = GRanges(seqnames = cancer_gene_sites$chr,
                         IRanges(start = cancer_gene_sites$start,
                                 end = cancer_gene_sites$end),
                         mocols = cancer_gene_sites$id)

# Loading mutations, format muts & creating a GRanges object
data(cll_mutations)

# format_muts
cll_mutations = format_muts(cll_mutations,
                            filter_hyper_MB = 30)
## 0 remove hypermut, n= 0 ,  0 %
## hypermuted samples:   
## 
## reversing 0 positions
## Removing  0  invalid SNVs & indels
gr_maf = GRanges(cll_mutations$chr,
                 IRanges(start = cll_mutations$pos1,
                         end = cll_mutations$pos2),
                 mcols=cll_mutations[,c("patient", "tag")])

# Examplifying the ATM Element
id = "ATM"

Note that when splitting tasks using the ADWGS_test function, only the parameter id needs to modified for each element while gr_element_coords, gr_sites and gr_maf can be complete datasets.

# Result of 1 input element
result = ADWGS_test(id = id,
                    gr_element_coords = gr_element_coords,
                    gr_site_coords = gr_site_coords,
                    gr_maf = gr_maf,
                    win_size = 50000)
## .
result
##      id  pp_element element_muts_obs element_muts_exp element_enriched
## 95% ATM 0.001807467                2                0             TRUE
##     pp_site site_muts_obs site_muts_exp site_enriched
## 95%      NA            NA            NA            NA

Technical Support

For questions, technical support or to report bugs and errors, please use our GitHub.