Getting Started with Capture Hi-C Analysis Engine

2020-07-27

Quickstart

The main function for running the CHiCANE method is chicane(). This takes a bam file and bed files with all restriction fragments and targeted restriction fragments as input, and produces a table with fragment interactions and associated p-values.

The package comes with a subset of reads from a capture Hi-C experiment on the Bre80 normal epithelial breast tissue cell line (Baxter et al. 2018). The experiment targeted several breast cancer risk loci in non-coding regions of the genome, and aimed to associate these risk variants with genes. Two of the risk SNPs are rs13387042 and rs16857609, and reads that map to them have been included in the file Bre80_2q35.bam.

library(chicane);
#> Loading required package: gamlss.tr
#> Loading required package: gamlss.dist
#> Loading required package: MASS
#> Loading required package: gamlss
#> Loading required package: splines
#> Loading required package: gamlss.data
#> 
#> Attaching package: 'gamlss.data'
#> The following object is masked from 'package:datasets':
#> 
#>     sleep
#> Loading required package: nlme
#> Loading required package: parallel
#>  **********   GAMLSS Version 5.1-6  **********
#> For more on GAMLSS look at http://www.gamlss.org/
#> Type gamlssNews() to see new features/changes/bug fixes.
#> Loading required package: data.table

# example BAM file, baits, and restriction fragments
bam <- system.file('extdata', 'Bre80_2q35.bam', package = 'chicane');
baits <- system.file('extdata', '2q35.bed', package = 'chicane');
fragments <- system.file('extdata', 'GRCh38_HindIII_chr2.bed.gz', package = 'chicane'); # HindIII fragments on chromosome 2

if( bedtools.installed() ) {
  chicane.results <- chicane(
    bam = bam,
    baits = baits,
    fragments = fragments
    );

    print( chicane.results[ 1:10 ] );
}

The chicane function returns a data.table object sorted by q-value. Each row contains information about the target and bait fragments, and the observed and expected number of reads linking the two fragments.

Pre-Processing Data

The chicane function can operate directly on BAM files. To convert the BAM file into a text file, R calls the shell script prepare_bam.sh. This script relies on bedtools to select reads overlapping the captured fragments and identify the restriction fragments corresponding to each read. The resulting data is read into a data.table object, and the number of reads linking any two fragments is calculated.

For users that have a digested genome output from HiCUP, this needs to be converted to BED format to be compatible with bedtools. CHiCANE includes a helper function, convert.hicup.digest.bed(), which can be used to convert a digested genome in HiCUP format to a BED file of the restiction digest fragments.

This processing can take a while (~45 minutes for a 19GB BAM file on a commodity computer with 13GB RAM), and only needs to be done once for each BAM. To speed up model fitting, it is possible to pre-process the data with the prepare.data() function and run chicane directly on the resulting data table.

if( bedtools.installed() ) {
  interaction.data <- prepare.data(
    bam = bam,
    baits = baits,
    fragments = fragments
    );
}

The interaction data object contains details of the fragments, and the number of reads linking the two fragments.

if( bedtools.installed() ) print(interaction.data);

By default, only combinations of fragments that are detected at least once are included in the data. To also include zero counts, use the include.zeros argument. Setting include.zeros = 'cis' includes all zero counts for bait/ target combinations on the same chromosome, while include.zeros = 'all' includes all possible combinations.

if( bedtools.installed() ) {
  cis.zero.results <- chicane(
    bam = bam,
    baits = baits,
    fragments = fragments,
    include.zeros = 'cis'
    ); 

    print( cis.zero.results[ 1:10 ] );
}

For most experiments including zeros will result in impractically large files. Another option for accounting for zero counts is to specify a truncated distribution through the distribution argument. However, this will slow down model fitting.

The interactions argument of the chicane() function can take either a data.table object from prepare.data or the path to such a table.

if( bedtools.installed() ) {
  file.name <- tempfile('interaction_data');
  write.table(interaction.data, file.name, row.names = FALSE);

  chicane.results <- chicane(interactions = file.name); 
}

Statistical Model

The CHiCANE method models the expected number of reads linking two restriction fragments as a function of the distance between the loci and the “interactibility” of the bait fragment, that is, its inherent propensity to interact with other fragments.

Let \(Y_{ij}\) denote the number of reads linking bait \(i\) and target fragment \(j\), \(d_{ij}\) be the distance between the two fragments, and \(t_i\) denote the number of reads linking the bait \(i\) with another fragment in trans. The CHiCANE model assumes that \(Y_{ij}\) follows a distribution with mean parameter \(\mu_{ij}\), where

\[\begin{equation} \mu_{ij} = \beta_0 + \beta_1\log(d_{ij}) + \beta_2\log(t_i + 1) \end{equation}\]

If the other end fragment \(j\) is also a bait, terms are included to adjust for the trans counts of the fragment \(j\), as well as the product of the trans counts of the two fragments as shown below

\[\begin{equation} \mu_{ij} = \beta_0 + \beta_1\log(d_{ij}) + \beta_2\log(t_i + 1) + \beta_3\log(t_j + 1) + \beta_4\log(t_i + 1)\log(t_j + 1) \end{equation}\]

Once an expected value \(\mu_{ij}\) is obtained for each combination of fragments, a \(p\)-value for the observed counts \(y_{ij}\) exceeding what’s expected by random chance is calculated as

\[\begin{equation} p = P(Y_{ij} \geq y_{ij}) \end{equation}\]

The probability \(P(Y_{ij} \geq y_{ij})\) depends on how the distribution \(Y_{ij} \mid \mu_{ij}\) of the counts conditional on the expected value is modelled.

Distribution of the Counts

The CHiCANE method supports several different distributions of the counts \(Y_{ij}\) conditional on the mean \(\mu_{ij}\). By default the counts are assumed to follow a negative binomial distribution with \(E(Y_{ij}) = exp (\mu_{ij})\). The negative binomial is similar to the Poisson model for counts, but includes an overdispersion term that makes it more appropriate for datasets with high variance. After fitting \(\beta_0, \beta_1, \beta_2\) and the overdispersion term by a maximum likelihood over all pairs of fragments, we use the fitted model to estimate a \(p\)-value for each pair. The Poisson, truncated Poisson, or truncated negative binomial distribution can be specified through the distribution argument.

Negative Binomial

The negative binomial distribution models the counts as \(Y \sim NB(\mu, \theta)\), where \(\mu\) is the mean and \(\theta\) is a dispersion parameter. Under this model, the variance of the counts conditional on the mean \(\mu\) is given by

\[\begin{equation} Var(Y) = \mu + \frac{\mu^2}{\theta} \end{equation}\]

The model is fit using the glm.nb() function in the MASS package. The maximum likelihood estimates (MLEs) of the coefficients \(\hat{\beta_i}\) are used to provide an estimate of the expected number of counts for each combination of restriction fragments. The MLE of the dispersion parameter \(\theta\) is used as the dispersion parameter (size) in the negative binomial model.

Occasionally, the model fitting procedure will result in numerical errors as the result of lack of convergence. This is most often due to lack of overdispersion in the data leading to an unstable estimate of \(\theta\). If the variance is approximately equal to the mean, the \(\theta\) term goes to infinity and the model fails to converge.

When the model throws a numerical warning or fails to converge, the chicane() function will attempt to diagnose if the problem was due to lack of overdispersion. This is done by fitting a negative binomial regression model with 25 iterations, and using the resulting fitted values and dispersion parameter to examine the variance of the negative binomial. If the estimated variance at the point with the largest fitted value is less than 0.1% greater than the corresponding estimate of the Poisson variance (i.e. the fitted value at that point), a Poisson distribution is fit instead.

Poisson

Unlike the negative binomial, the Poisson distribution does not allow for the variance of the counts to exceed their mean. This model assumes that \(Y \sim \text{Poisson}(\mu)\), giving \(Var(Y) = \mu\). In practice, capture Hi-C counts often show more variability than can be explained by the Poisson model, leading to false positive interaction calls.

Truncated Negative Binomial

When fitting the model without including unobserved combinations of fragments (see Pre-processing Data), there are no counts of zero in the data. To accommodate this in the statistical model, a truncated Poisson distribution can be fit using the GAMLSS package (Stasinopoulos and Rigby 2016). This model assigns \(P(Y = 0) = 0\), and assumes the probabilities are proportional to the negative binomial probabilities for values \(y > 0\).

Truncated Poisson

Similarly to the truncated negative binomial, the truncated Poisson distribution is supported by setting distribution = 'truncated-poisson'. This model assumes the probabilities are proportional to the Poisson probabilities for values \(y>0\).

Adjusting for Covariates

Adjusting for distance is done in a piecewise linear fashion. Cis-data is ordered by distance, and split into equally sized bins. The count model is then fit separately within each bin, and results are merged at the end by concatenating the results from the individual model fits.

By default, CHiCANE tries to split the data based on distance quantiles, i.e. into 100 equally sized groups. In some cases, particularly when dealing with smaller datasets, the resulting datasets are too small for model fitting. When this occurs, the number of distance bins are halved until all resulting datasets are large enough for the model to be fit. To override the default behaviour, pass the desired number of distance bins to distance.bins. For example, setting distance.bins = 4 results in the count model being fit separately in each quartile. Trans interactions are fit separately from cis interactions.

Filtering Baits and Targets

By default, all baits and targets are included when fitting the CHiCANE model. An alternative approach is to filter out baits and fragments with low (Dryden et al. 2014) or high (Cairns et al. 2016) degree of “interactibility”, as inferred through the trans counts. CHiCANE also supports this model through the bait.filters and target.filters arguments.

Each of these take a vector of length two, where each element corresponds to the proportion of fragments that should be considered to fall below the “lower limit” and “upper limit.” For example, passing in bait.filters = c(0.3, 0.9) means that the baits with the lowest 30% or highest 10% of trans counts will be removed before fitting the model.

Filtering fragments may affect multiple testing correction by changing the number of tests performed.

Multiple Testing Correction

By default, CHiCANE performs multiple testing correction separately for each bait. To change this to a global multiple testing correction, set multiple.testing.correction = 'global' in the main chicane() function.

Merging Biological Replicates

CHiCANE can merge replicates at the data processing step. If more than one BAM file is available, these can be passed as vector to the bam argument of the chicane() and prepare.data() functions. The replicate.merging.method determines how replicates are merged. Available options are ‘sum’ and ‘weighted-sum’.

Adjusting for Custom Covariates

CHiCANE allows users to specify additional covariates that should be added to the model with the adjustment.terms argument. For example, we can add a term to adjust for the chromosome of the target fragment.

data(bre80);
adjusted.results <- chicane(
  interactions = bre80, 
  adjustment.terms = 'target.chr'
  );

print( adjusted.results[ 1:5 ] );
#>                   target.id                  bait.id target.chr target.start
#> 1:    chr19:1155128-1228414 chr2:217035649-217042355      chr19      1155128
#> 2: chr1:117125276-117135137 chr2:217035649-217042355       chr1    117125276
#> 3:   chr1:14950583-14951791 chr2:217035649-217042355       chr1     14950583
#> 4:   chr1:49076960-49080684 chr2:217035649-217042355       chr1     49076960
#> 5: chr3:145680459-145682483 chr2:217035649-217042355       chr3    145680459
#>    target.end bait.chr bait.start  bait.end bait.to.bait bait.trans.count
#> 1:    1228414     chr2  217035649 217042355        FALSE             9494
#> 2:  117135137     chr2  217035649 217042355        FALSE             9494
#> 3:   14951791     chr2  217035649 217042355        FALSE             9494
#> 4:   49080684     chr2  217035649 217042355        FALSE             9494
#> 5:  145682483     chr2  217035649 217042355        FALSE             9494
#>    target.trans.count distance count expected   p.value   q.value
#> 1:                  2       NA     2 1.003963 0.2656989 0.6417435
#> 2:                  2       NA     2 1.003992 0.2657096 0.6417435
#> 3:                  2       NA     2 1.003992 0.2657096 0.6417435
#> 4:                  2       NA     2 1.003992 0.2657096 0.6417435
#> 5:                  2       NA     2 1.004014 0.2657179 0.6417435

The adjustment.terms argument also supports more complex expressions such as log-transformations. Multiple adjustment terms can be added by passing a vector.

Any adjustment term must correspond to a column already present in the data. If you would like to adjust for anything that is not already present in the CHiCANE output (e.g. GC content), there’s a three step approach:

  1. Run prepare.data() on your BAM file
  2. Add extra columns to the resulting data frame
  3. Run chicane() with the interactions argument

Session Info

sessionInfo();
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Sierra 10.12.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] C/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
#> 
#> attached base packages:
#> [1] parallel  splines   stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#> [1] chicane_0.1.2     data.table_1.12.8 gamlss.tr_5.1-0   gamlss_5.1-6     
#> [5] nlme_3.1-148      gamlss.data_5.1-4 gamlss.dist_5.1-6 MASS_7.3-51.6    
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.4.6         compiler_3.6.1       formatR_1.7         
#>  [4] bedr_1.0.7           futile.logger_1.4.3  iterators_1.0.12    
#>  [7] R.methodsS3_1.8.0    futile.options_1.0.1 R.utils_2.9.2       
#> [10] tools_3.6.1          testthat_2.3.2       digest_0.6.25       
#> [13] evaluate_0.14        lattice_0.20-41      rlang_0.4.6         
#> [16] Matrix_1.2-18        foreach_1.5.0        yaml_2.2.1          
#> [19] VennDiagram_1.6.20   xfun_0.14            stringr_1.4.0       
#> [22] knitr_1.28           grid_3.6.1           R6_2.4.1            
#> [25] survival_3.1-12      rmarkdown_2.1        lambda.r_1.2.4      
#> [28] magrittr_1.5         codetools_0.2-16     htmltools_0.4.0     
#> [31] stringi_1.4.6        R.oo_1.23.0

Acknowledgements

Development of CHiCANE was supported by Breast Cancer Now.

References

Baxter, Joseph S, Olivia C Leavy, Nicola H Dryden, Sarah Maguire, Nichola Johnson, Vita Fedele, Nikiana Simigdala, et al. 2018. “Capture Hi-c Identifies Putative Target Genes at 33 Breast Cancer Risk Loci.” Nature Communications 9 (1): 1028.

Cairns, Jonathan, Paula Freire-Pritchett, Steven W Wingett, Csilla Várnai, Andrew Dimond, Vincent Plagnol, Daniel Zerbino, et al. 2016. “CHiCAGO: Robust Detection of Dna Looping Interactions in Capture Hi-c Data.” Genome Biology 17 (1): 127.

Dryden, Nicola H, Laura R Broome, Frank Dudbridge, Nichola Johnson, Nick Orr, Stefan Schoenfelder, Takashi Nagano, et al. 2014. “Unbiased Analysis of Potential Targets of Breast Cancer Susceptibility Loci by Capture Hi-c.” Genome Research 24 (11): 1854–68.

Stasinopoulos, Mikis, and Bob Rigby. 2016. Gamlss.tr: Generating and Fitting Truncated “Gamlss.family” Distributions. https://CRAN.R-project.org/package=gamlss.tr.