GPrank vignette

Hande Topa and Antti Honkela

2018-08-17

Introduction

GPrank package has been built upon the gptk package implemented by Kalaitzis et al., 2011 with the addition of “fixed variance”" kernel which allows to incorporate additional variance information from pre-processing of the observations into the Gaussian process (GP) regression models.

GPs are an ideal model for short and irregularly sampled time series and they can be used to model and rank multiple time series, each generated by different items within an experiment.

In our papers Topa et al., 2015 and Topa and Honkela, 2016, we have proposed variance estimation methods for Pool-seq and RNA-seq data, and evaluated the performance of our GP-based ranking method by comparing the precision and recall under different scenarios with and without variance usage. Simulation results have shown that variance usage leads to a higher average precision, which means less false positives appearing in the top of the ranked list. Motivated by these results, here we will explain how to use GPrank package and then provide examples for two different applications we had in our papers.

Citing GPrank

To cite GPrank in publications, please cite relevant of the two methodology papers Topa et al., 2015 and Topa and Honkela, 2016 that the software is based on.

Methods overview

Our GP-based ranking method uses Bayes factors to rank muliple time series where the Bayes factors are computed for each item by the ratio of its maximum marginal likelihood under two alternative GP models, namely time-dependent and time-independent. Time-independent model (which is referred as the null model in the package) assumes no temporal dependency between observations and uses only a white noise kernel to model the noise. Time-dependent model (which is referred as the model in the package) on the other hand, assumes a smooth temporal behavior, and in addition to the white noise kernel, it also includes a radial basis function (RBF) kernel to capture the temporal dependency. Furthermore, we use a fixed variance kernel in both models in order to incorporate variance information which could be obtained by appropriate estimation methods during pre-processing. For more technical details about the GP models, please refer to Rasmussen and Williams, 2006 as well as the papers mentioned in section Citing GPrank.

General usage of GPrank

Installing the package

In order to install GPrank package from the GitHub repository, start R and run the following command:

In order to install from CRAN, simply use the following command:

To load the package, run:

Necessary data

In order to construct a GP model, three vectors must be provided for each item. These vectors are:

Once we have obtained these vectors, we can construct a GP model with constructModel function using different kernels such as ‘rbf’, ‘white’, and ‘fixedvariance’.

Example:

Please make sure that the three vectors have the same length with each other. If the data is replicated, please remember to adjust the input vector accordingly. For example, if there are two replicates observed at n time points from time t1 to tn, vector t must be defined as: t=[t1, t1, t2, t2, … , tn, tn].

Fitting the models

apply_gpTest function takes t, y, and v vectors as input arguments and fits two alternative GP models to the data, and computes the log Bayes factors:

Note: By default, apply_gpTest function constructs the null model with kernelTypes=c('white','fixedvariance') and the alternative model with kernelTypes=c('rbf','white','fixedvariance').

Visualizing the models

In order to visualize the fitted GP model, one can use the plotGP function. One can also specify the color of the plot with the second argument. One can optionally specify the limits of the y axis as the third argument. This helps to adjust the plotting area when GP models of multiple items are displayed in a single figure. For example, y-axis limits can be determined by using getYlimits function in such cases. This function adjusts the plotting area between the minimum and maximum values of multiple models also taking into account two standard deviation confidence intervals.

In addition, a color palette containing the distinctive colors from RColorBrewer package can be obtained with the function getColorVector. The generated plot displays \(\pm 2\) standard deviations confidence region around the fitted model and errorbars denoting \(\pm 2\) standard deviations (provided from fixed variances) around the observations.

Building SQLite database

Once we have had all the results ready, and saved the figures in png format, we can use createDatabase function to create a database which can be used to view the results in the web browser with the help of tigreBrowser. tigreBrowser can be used to display the GP profiles on a browser and to filter them according to provided parameters such as Bayes factors. tigreBrowser is available on GitHub and can be installed with the command:

python setup.py install –user.

Following is a simplified example to create a database containing the provided parameters (Bayes factors and fold changes) and the figures for the genes geneA and geneB:

Note: Please name the figures starting with their corresponding identifiers followed by an underscore and the type of the figure. Specifying the type of the figures allows to display multiple figures for each item.

Once the database is created, it can be viewed with the command:

python tigreServer.py -d database_name.sqlite.

Applications

RNA-seq transcript expression analysis using BitSeq

Sample data

For demonstrating the usage of the functions with examples, we will be using a small sample data from an RNA-seq time series experiment which was introduced by Honkela et al., 2015 and also analysed by Topa and Honkela, 2016. The sample data set, named RNAseqDATA, contains mean and standard deviation information on the expression levels of 5 transcripts (which were originated from 2 genes) at 10 time points (0, 5, 10, 20, 40, 80, 160, 320, 640, 1280 mins) for three settings: “gene”, “abstr”" (absolute transcript), and “reltr”" (relative transcript) expression levels. In addition, the fields “gene_mapping”" and “time_mapping”" include information which is useful to match the genes with transcripts and the time points with data files, respectively. In order to load the data set, type:

If one is interested in getting this data structure from raw BitSeq output files himself, he may use the bitseq_rnaSeqData function:

Assuming BitSeq output files at different time points are named as “t0000.rpkm”, “t0005.rpkm”, “t0010.rpkm”, “t0020.rpkm”, “t0040.rpkm”, “t0080.rpkm”, “t0160.rpkm”, “t0320.rpkm”, “t0640.rpkm”, “t1280.rpkm” and the transcript information file is named as “example_tr”, the following should be able to produce the data structure we have presented here.

Fitting the models

From now on, let us continue with the gene-level data although one can simply perform the same with reltr and abstr levels as well. The function bitseq_fitGPs can be used to fit two GP models to each gene and compute the log Bayes factors:

If one is interested in saving the results into files, one should remember to specify the file names for fileName_logBF, fileName_ModelParams, fileName_NullModelParams and input them as arguments in the bitseq_fitGPs function:

Visualizing the models

Having the GP models fitted to the genes, one can plot the GP profile of a specified gene with the function bitseq_plotGP. For example, the GP profile of the gene ARAP2 shown below can be obtained by the following codes:

Fitted GP model for the overall gene expression levels

Fitted GP model for the overall gene expression levels

In order to save the figure, one can also specify the figure name in plotName option:

The input multi determines whether multiple plots (multi=1) or only a single plot (multi=0) will be plotted on the same figure. For example, if we would like to plot the GP profiles of all the transcripts of ARAP2 gene, we can display all on the same plot by settingmulti` to 1. Let’s try this for absolute transcript expression levels of ARAP2 gene and obtain the figure:

Fitted GP model for the absolute transcript expression levels

Fitted GP model for the absolute transcript expression levels

Let us also do the same for relative transcript expression levels and obtain the corresponding figure:

Fitted GP model for the relative transcript expression levels

Fitted GP model for the relative transcript expression levels

Quantitative analysis of population sequencing data

In Topa et al., 2015, we developed a GP-based method BBGP - beta binomial Gaussian process - for modeling the SNP frequencies in fruit fly populations across several generations in an experimental evolution study. The same method can in principle be used to analyse any population sequencing data where one is interested in the proportion of reads aligning to a specific location that contain a specific feature (such as a SNP).

Sample data

Here we will use a small sample data set named snpData. This sample data set contains 5 replicates of counts and sequencing depth information for 5 SNPs at the generations (0, 10, 20, 30, 40, 50, 60). In order to load the data set, run the command:

If one is interested in getting this data structure from raw data files which have the same format as the synchronized files used by PoPoolation2, (s)he may use the bbgp_snpData function.

Fitting the models

Given the counts and sequencing depth levels, we can use get_bbgpMeanStd function in order to get the posterior means and standard deviations of the frequencies using a beta binomial model with parameters \(\alpha\) and \(\beta\) set to 1.

Then, we can perform our GP-based test with apply_gpTest function:

Visualizing the models

Once we have fitted the GP model, we can visualize it using the plotGP function and obtain the figure:

Fitted GP model of the SNP frequencies across generations

Fitted GP model of the SNP frequencies across generations

jitterx option can be used to jitter the observations measured at same time points for clearer visualisation.

Session info

sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Sierra 10.12.6
#> 
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] C/UTF-8/C/C/C/C
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] GPrank_0.1.4
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_0.12.17             knitr_1.20              
#>  [3] magrittr_1.5             maps_3.3.0              
#>  [5] gptk_1.08                bit_1.1-14              
#>  [7] lattice_0.20-35          highr_0.6               
#>  [9] blob_1.1.1               stringr_1.3.1           
#> [11] fields_9.6               tools_3.5.1             
#> [13] dotCall64_0.9-5.2        grid_3.5.1              
#> [15] spam_2.1-4               DBI_1.0.0               
#> [17] htmltools_0.3.6          matrixStats_0.53.1      
#> [19] bit64_0.9-7              yaml_2.1.19             
#> [21] rprojroot_1.3-2          digest_0.6.15           
#> [23] Matrix_1.2-14            RColorBrewer_1.1-2      
#> [25] tigreBrowserWriter_0.1.5 memoise_1.1.0           
#> [27] evaluate_0.10.1          RSQLite_2.1.1           
#> [29] rmarkdown_1.10           stringi_1.2.2           
#> [31] compiler_3.5.1           backports_1.1.2

References