Our package, rolypoly, is an implementation of the RolyPoly model for identifying trait-associated functional annotations. Specifically, we use rolypoly to find enrichment of signal from association summary statistics of SNPs releaed by genome-wide association studies (GWAS) in cellular function annotations collected from RNA-seq (many times single-cell based cell types). In this vignette I will walk the reader through a typical run of the rolypoly pipeline to identify tissues relevant for determining an individuals total cholesterol level.
# load up libraries for vignette
require(rolypoly); require(dplyr); require(ggplot2)
To run rolypoly we need GWAS summary statistics, expression data, an expression data annotation file, and LD information. We have included a simulated version of each of these (included with rolypoly installation) so one could follow along.
For this vignette we will use previously simulated GWAS summary statistics. Each column is required and should be self-explanatory. We only use autosomes, pos refers to the base pair position of the SNP whose rsid is included. The beta column is the univarite standardized regression coefficient and is a standard summary statistic released by GWAS. In cases where the GWAS is a case-control trait this column can be occupied with the log of the odds ratio. Usually a standard error term is released by a GWAS (se). Lastly, we require a column for minor allele frequency (maf) to filter out rare variants. This columns can also be used to model the effect of allele frequency on effect size but that’s out of the scope of this vignette.
sim_gwas_data %>% head
## chrom pos rsid beta se maf
## 1 1 501 rs11185553 -0.02675251 1e-04 0.08927857
## 2 1 1126 rs11654695 0.20745602 1e-04 0.12183933
## 3 1 1751 rs8072649 0.01443821 1e-04 0.47337875
## 4 1 2376 rs8072804 -0.17526228 1e-04 0.21194827
## 5 1 3000 rs12718064 -0.02771406 1e-04 0.04092157
## 6 1 3625 rs11649979 -0.13679634 1e-04 0.13397018
Gene expression data should be a data frame with rownames labeled by the gene (usually we use ENSG gene ids) and column names corresponding to annotations we want to test for associations with the GWAS trait. Our simulated expression data set has 5 tissues and 1000 genes.
sim_expression_data_normalized %>% head
## Liver Adrenal.Gland Blood Heart Lung
## g1 2.390297459 2.2519542 0.1062938 0.03554189 1.617819131
## g2 0.006039999 0.8129411 0.8210912 0.05205207 0.208679629
## g3 3.221239161 9.8011018 0.9057806 0.44776858 2.841035714
## g4 0.061449654 0.1067156 2.0547203 0.42114277 0.015974171
## g5 0.011341290 3.5957175 1.6927527 0.07999294 0.003177456
## g6 0.840609509 2.2360014 0.3164782 1.83555862 0.545934668
Of note, we require that genes are comparable across rows. So, if you take a raw expression matrix we suggest first performing quantile normalization to insure the columns have the same distribution, and then using a function like scale
to normalize the rows. Furthermore, rolypoly does not do well with negative expression values, thus, use abs
or square the expression numbers. Such a procedure is consistant with the hypothesis that deviations from mean gene expression lead to GWAS effects with larger variance.
To link gene expression with the location of GWAS variants we require a block annotation data frame. It consists of the chromosome, start and end of the block and a block label that should correspond with the gene expression rowname in the expression data set. For our work, we defined a block as a 10kb window centered around each gene’s TSS. Feel free to change block annotation start and end points to increase or decrease this window size.
sim_block_annotation %>% head
## chrom start end label
## 1 1 500 10500 g1
## 2 1 50545 60545 g2
## 3 1 100589 110589 g3
## 4 1 150634 160634 g4
## 5 1 200678 210678 g5
## 6 1 250723 260723 g6
Our model accounts for the effects of LD by using Pearson’s r correlation values pairwise between SNPs. If these data are not available for the actual GWAS population, you may substitute LD information from a similar reference population. For many studies, we found that calculating these values from 1000g phase 3 european populations works well.
We have included a simulated LD dataset to explore the format we require, which was constructed from a sample of LD data from chromosome one. The main rolypoly function call takes a path to a folder with files with LD data, one for each chromosome (labeled 1-22), and the .Rds suffix. Each of these Rds objects contains columns corresponding to the chromosome, base pair, minor allele frequency (optional), for each SNP. In addition there is the column labeled R which contains the Pearson’s r correlation between two variants. This was all based on the output from PLINK.
ld_path <- system.file("extdata", "example_ld", package = "rolypoly")
ld_data <- readRDS(paste(ld_path, '/1.Rds', sep = ''))
ld_data %>% head
## CHR_A BP_A SNP_A MAF_A CHR_B BP_B SNP_B
## 1: 1 166770696 rs146302744 0.2743540 1 166770940 rs1311572
## 2: 1 18868193 rs11261030 0.3091450 1 18868264 rs12026820
## 3: 1 229724897 rs6693684 0.2087480 1 229725172 rs56381870
## 4: 1 160407417 rs193258399 0.0129225 1 160407840 rs10908780
## 5: 1 175824630 rs7548697 0.1948310 1 175824730 rs17351332
## 6: 1 248511928 rs28419186 0.4761430 1 248512064 rs28372410
## MAF_B R
## 1: 0.2743540 1.0000000
## 2: 0.0248509 0.1990370
## 3: 0.2554670 -0.3015770
## 4: 0.2485090 -0.0510221
## 5: 0.0318091 -0.0770507
## 6: 0.4761430 1.0000000
Rather than calculate your own LD statistics we have provided previously formatted LD data at the following url: https://drive.google.com/file/d/0B_X6s0BThq9ZXzNNbjk3V2hNdlk/view?usp=sharing
In these files we include LD calculated using PLINK for 1000g phase 3 genomes filtered for values of \(R^2 > 0.2\).
We include all the previously described data into the main rolypoly function call. This function has many parameters to tinker with, however, should run fine with the defaults. Most importantly consider the number of bootstrap iterations to get accurate standard errors. We usually use at least 200.
rp <- rolypoly_roll(
gwas_data = sim_gwas_data,
block_annotation = sim_block_annotation,
block_data = sim_expression_data_normalized,
ld_folder = ld_path
)
Once rolypoly is finished we can access all the results within the returned rolypoly object. We generated the GWAS under the model with effects of 0.02 and 0.01 for the Liver and Blood tissues. To take a look at the inferred parameters use:
rp$full_results$parameters %>% sort
## intercept Heart Adrenal.Gland Lung Blood
## -3.423119e-04 -6.889941e-05 3.951817e-05 5.611162e-05 1.016319e-02
## Liver
## 1.965506e-02
Better yet, there’s a data frame with results from the bootstrap runs that calculates standard errors for these paramter estimates, p-values, and 95% confidence intervals.
rp$bootstrap_results %>% arrange(-bt_value) %>% head
## annotation bootstrap_estimate bootstrap_error bt_value
## 1 Blood 1.016393e-02 0.0004117827 24.68274426
## 2 Liver 1.988451e-02 0.0009349661 21.26762218
## 3 Lung 1.085798e-04 0.0003384737 0.32079234
## 4 Adrenal.Gland 2.191802e-05 0.0002327923 0.09415267
## 5 Heart -8.080214e-05 0.0002140410 -0.37750783
## 6 intercept -5.640717e-04 0.0009861980 -0.57196597
## bp_value bias_corrected_estimate CI_lo CI_hi
## 1 8.193800e-135 1.016246e-02 0.0094234901 0.0109550317
## 2 1.132366e-100 1.942561e-02 0.0179163188 0.0215008450
## 3 3.741839e-01 3.643459e-06 -0.0006551660 0.0006466112
## 4 4.624939e-01 5.711831e-05 -0.0005284865 0.0003677855
## 5 6.471019e-01 -5.699668e-05 -0.0004492553 0.0003715537
## 6 7.163275e-01 -1.205521e-04 -0.0018678198 0.0016170568
For visualization, one can use the following function which plots the estimate with 95% confidence intervals,
plot_rolypoly_annotation_estimates(rp)
Additionally, we plot the \(-log10(p)\) to rank tissues by the strength of their association
plot_rolypoly_annotation_ranking(rp)
These functions return ggplot2 objects so free to manipulate them as such.
Without any parallelization rolypoly could take a couple hours to run. Much of the computation is spent linking SNPs to genes and reading in LD information. If one provides the a rolypoly object to the rolypoly parameter of the rolypoly_roll
function call and a new object of block data, then only inference is performed. For example with the previous rolypoly object we would run inference on a new expression data set with the following call (assuming that the gene labels did not change between datasets).
rp <- rolypoly_roll(
# some new set of expression data
block_data = new_sim_expression_data_normalized,
)
Thus, one can run precomputation (slow) once and then rerun inference (faster) on various expression matrices more quickly.