Introduction to rolypoly

Diego Calderon

2017-03-15

Our package, rolypoly, is an implementation of the RolyPoly model for identifying trait-associated functional annotations. Specifically, we use rolypoly to find enrichment of signal from association summary statistics of SNPs releaed by genome-wide association studies (GWAS) in cellular function annotations collected from RNA-seq (many times single-cell based cell types). In this vignette I will walk the reader through a typical run of the rolypoly pipeline to identify tissues relevant for determining an individuals total cholesterol level.

# load up libraries for vignette
require(rolypoly); require(dplyr); require(ggplot2)

Requisite data

To run rolypoly we need GWAS summary statistics, expression data, an expression data annotation file, and LD information. We have included a simulated version of each of these (included with rolypoly installation) so one could follow along.

GWAS

For this vignette we will use previously simulated GWAS summary statistics. Each column is required and should be self-explanatory. We only use autosomes, pos refers to the base pair position of the SNP whose rsid is included. The beta column is the univarite standardized regression coefficient and is a standard summary statistic released by GWAS. In cases where the GWAS is a case-control trait this column can be occupied with the log of the odds ratio. Usually a standard error term is released by a GWAS (se). Lastly, we require a column for minor allele frequency (maf) to filter out rare variants. This columns can also be used to model the effect of allele frequency on effect size but that’s out of the scope of this vignette.

sim_gwas_data %>% head
##   chrom  pos       rsid        beta    se        maf
## 1     1  501 rs11185553 -0.02675251 1e-04 0.08927857
## 2     1 1126 rs11654695  0.20745602 1e-04 0.12183933
## 3     1 1751  rs8072649  0.01443821 1e-04 0.47337875
## 4     1 2376  rs8072804 -0.17526228 1e-04 0.21194827
## 5     1 3000 rs12718064 -0.02771406 1e-04 0.04092157
## 6     1 3625 rs11649979 -0.13679634 1e-04 0.13397018

Expression data

Gene expression data should be a data frame with rownames labeled by the gene (usually we use ENSG gene ids) and column names corresponding to annotations we want to test for associations with the GWAS trait. Our simulated expression data set has 5 tissues and 1000 genes.

sim_expression_data_normalized %>% head
##          Liver Adrenal.Gland     Blood      Heart        Lung
## g1 2.390297459     2.2519542 0.1062938 0.03554189 1.617819131
## g2 0.006039999     0.8129411 0.8210912 0.05205207 0.208679629
## g3 3.221239161     9.8011018 0.9057806 0.44776858 2.841035714
## g4 0.061449654     0.1067156 2.0547203 0.42114277 0.015974171
## g5 0.011341290     3.5957175 1.6927527 0.07999294 0.003177456
## g6 0.840609509     2.2360014 0.3164782 1.83555862 0.545934668

Of note, we require that genes are comparable across rows. So, if you take a raw expression matrix we suggest first performing quantile normalization to insure the columns have the same distribution, and then using a function like scale to normalize the rows. Furthermore, rolypoly does not do well with negative expression values, thus, use abs or square the expression numbers. Such a procedure is consistant with the hypothesis that deviations from mean gene expression lead to GWAS effects with larger variance.

Gene annotation

To link gene expression with the location of GWAS variants we require a block annotation data frame. It consists of the chromosome, start and end of the block and a block label that should correspond with the gene expression rowname in the expression data set. For our work, we defined a block as a 10kb window centered around each gene’s TSS. Feel free to change block annotation start and end points to increase or decrease this window size.

sim_block_annotation %>% head
##   chrom  start    end label
## 1     1    500  10500    g1
## 2     1  50545  60545    g2
## 3     1 100589 110589    g3
## 4     1 150634 160634    g4
## 5     1 200678 210678    g5
## 6     1 250723 260723    g6

Linkage disequilibrium (LD)

Our model accounts for the effects of LD by using Pearson’s r correlation values pairwise between SNPs. If these data are not available for the actual GWAS population, you may substitute LD information from a similar reference population. For many studies, we found that calculating these values from 1000g phase 3 european populations works well.

We have included a simulated LD dataset to explore the format we require, which was constructed from a sample of LD data from chromosome one. The main rolypoly function call takes a path to a folder with files with LD data, one for each chromosome (labeled 1-22), and the .Rds suffix. Each of these Rds objects contains columns corresponding to the chromosome, base pair, minor allele frequency (optional), for each SNP. In addition there is the column labeled R which contains the Pearson’s r correlation between two variants. This was all based on the output from PLINK.

ld_path <- system.file("extdata", "example_ld", package = "rolypoly")
ld_data <- readRDS(paste(ld_path, '/1.Rds', sep = ''))
ld_data %>% head
##    CHR_A      BP_A       SNP_A     MAF_A CHR_B      BP_B      SNP_B
## 1:     1 166770696 rs146302744 0.2743540     1 166770940  rs1311572
## 2:     1  18868193  rs11261030 0.3091450     1  18868264 rs12026820
## 3:     1 229724897   rs6693684 0.2087480     1 229725172 rs56381870
## 4:     1 160407417 rs193258399 0.0129225     1 160407840 rs10908780
## 5:     1 175824630   rs7548697 0.1948310     1 175824730 rs17351332
## 6:     1 248511928  rs28419186 0.4761430     1 248512064 rs28372410
##        MAF_B          R
## 1: 0.2743540  1.0000000
## 2: 0.0248509  0.1990370
## 3: 0.2554670 -0.3015770
## 4: 0.2485090 -0.0510221
## 5: 0.0318091 -0.0770507
## 6: 0.4761430  1.0000000

Rather than calculate your own LD statistics we have provided previously formatted LD data at the following url: https://drive.google.com/file/d/0B_X6s0BThq9ZXzNNbjk3V2hNdlk/view?usp=sharing

In these files we include LD calculated using PLINK for 1000g phase 3 genomes filtered for values of \(R^2 > 0.2\).

Rolling rolypoly

We include all the previously described data into the main rolypoly function call. This function has many parameters to tinker with, however, should run fine with the defaults. Most importantly consider the number of bootstrap iterations to get accurate standard errors. We usually use at least 200.

rp <- rolypoly_roll(
  gwas_data = sim_gwas_data,
  block_annotation = sim_block_annotation,
  block_data = sim_expression_data_normalized,
  ld_folder = ld_path
)

results

Once rolypoly is finished we can access all the results within the returned rolypoly object. We generated the GWAS under the model with effects of 0.02 and 0.01 for the Liver and Blood tissues. To take a look at the inferred parameters use:

rp$full_results$parameters %>% sort
##     intercept         Heart Adrenal.Gland          Lung         Blood 
## -3.423119e-04 -6.889941e-05  3.951817e-05  5.611162e-05  1.016319e-02 
##         Liver 
##  1.965506e-02

Better yet, there’s a data frame with results from the bootstrap runs that calculates standard errors for these paramter estimates, p-values, and 95% confidence intervals.

rp$bootstrap_results %>% arrange(-bt_value) %>% head
##      annotation bootstrap_estimate bootstrap_error    bt_value
## 1         Blood       1.016393e-02    0.0004117827 24.68274426
## 2         Liver       1.988451e-02    0.0009349661 21.26762218
## 3          Lung       1.085798e-04    0.0003384737  0.32079234
## 4 Adrenal.Gland       2.191802e-05    0.0002327923  0.09415267
## 5         Heart      -8.080214e-05    0.0002140410 -0.37750783
## 6     intercept      -5.640717e-04    0.0009861980 -0.57196597
##        bp_value bias_corrected_estimate         CI_lo        CI_hi
## 1 8.193800e-135            1.016246e-02  0.0094234901 0.0109550317
## 2 1.132366e-100            1.942561e-02  0.0179163188 0.0215008450
## 3  3.741839e-01            3.643459e-06 -0.0006551660 0.0006466112
## 4  4.624939e-01            5.711831e-05 -0.0005284865 0.0003677855
## 5  6.471019e-01           -5.699668e-05 -0.0004492553 0.0003715537
## 6  7.163275e-01           -1.205521e-04 -0.0018678198 0.0016170568

For visualization, one can use the following function which plots the estimate with 95% confidence intervals,

plot_rolypoly_annotation_estimates(rp)

Additionally, we plot the \(-log10(p)\) to rank tissues by the strength of their association

plot_rolypoly_annotation_ranking(rp)

These functions return ggplot2 objects so free to manipulate them as such.

Inference on other expression data set

Without any parallelization rolypoly could take a couple hours to run. Much of the computation is spent linking SNPs to genes and reading in LD information. If one provides the a rolypoly object to the rolypoly parameter of the rolypoly_roll function call and a new object of block data, then only inference is performed. For example with the previous rolypoly object we would run inference on a new expression data set with the following call (assuming that the gene labels did not change between datasets).

rp <- rolypoly_roll(
  # some new set of expression data
  block_data = new_sim_expression_data_normalized,
)

Thus, one can run precomputation (slow) once and then rerun inference (faster) on various expression matrices more quickly.