The Biopeak R package enables the user to systematically identify and visualize impulse-like gene expression changes within short genomic series experiments. On a systems level, this tool allows to characterize dynamic gene expression signatures underlying biological processes. In order to detect such activation peaks, the gene expression is treated as a signal that propagates along the experimental axis (time, temperature or other series conditions).
To demonstrate the functionality of this package, we are using a dataset comprising transcriptional profiles of human epithelial cells in response to heat available on GEO (GSE7458) (J. M. Laramie 2008). In vitro cultured HEp2 cells were heated at 37 to 43 °C for 60 min and microarray gene expression profiles were acquired at 37, 40, 41, 42 and 43 °C.
In a first step, we load the heat-shock data:
library(Biopeak)
# load the heat-shock data
exprmat <- heat
# Show first rows and columns of the expression matrix
head(exprmat)
## Hep2_37 Hep2_40 Hep2_41 Hep2_42 Hep2_43
## DAZL 25.64188 26.44370 48.65872 47.51223 17.54344
## RAD23A 776.89917 609.15550 602.97183 823.58900 502.01850
## RAP1B 1033.05450 1232.86550 1028.10150 900.49950 1074.90267
## ARHGAP6 35.05837 53.72763 26.60629 31.21045 29.21352
## GYPB 157.53467 167.29190 159.62667 157.61000 196.08050
## UBE2N 464.30233 517.65315 549.37597 682.46653 521.46451
One of the strengths of the peak detection algorithm lays in its great flexibility to process data source-independently. While the main focus of this package is aimed at the characterization of genomic data, in theory, any type of series experiment that can be described as a signal could be analyzed. Since peaks are assessed for each gene relative to its expression over time or over the course of different stimuli strenghts, prior normalization steps are usualy not required. Therefore, we can directly subject the Rna-Seq or microarray count expression matrix to the peak detection algorithm: the peakDetection function.
The peakDetection function takes as input the expression matrix and the series description, a numeric vector which defines the conditions/time-points of sample acquisition. In order to identify biomarkers which are specific to a distinct phase of the observed biological process, we provide a set of tunable filters that define the characterstics of a peak. Peak characteristics that have to be defined are:
Furthermore, the function supports optional parameters for more complex operations:
In this case study, we are primarily interested in finding genes that peak at a single condition, in order to identify gene activation thresholds under heat-shock conditions. Thus, we set the prominence parameter to 1.3, to select for genes that feature a peak that is at least 30 % higher in magnitude than any other peak. Moreover, we want to filter out peaks with low heights relative to the mean expression. Therefore the actstrength parameter is set to 1.5, selecting for peaks that feature at least 50 % higher expression than the mean expression of that gene. Lastly, we selected only the top 50 % expressed genes, to reduce the computational load.
# define condition series
series <- c(37,40,41,42,43)
# Find the expression limit for the 50 % highest expressing genes and selec them
exprlim <- quantile(exprmat, probs <- seq(0,1,0.01))[51]
exprmat <- exprmat[which(rowMeans(exprmat) > exprlim),]
# run the peak detection algorithm
peakdet <- peakDetection(exprmat, series, type ='rnaseq', actstrength = 1.5, prominence = 1.3,
bgcorr = F)
The peakDetection function returns a set of vectors and matrices:
The output of the peakDetection function provides a format that facilitates queries on a gene-level and systems-level.
## [1] 205
## [1] "CNNM2" "SFI1"
## [3] "PNN" "KRT7"
## [5] "BASP1" "RAF1"
## [7] "GBA /// GBAP1 /// LOC100510710" "GADD45B"
## [9] "MKL1" "CDK13"
## [11] "NRG1" "XKRY /// XKRY2"
## [13] "H3F3A /// H3F3B /// MIR4738" "RCAN2"
## [15] "LOC100507577 /// LONP2" "SLC7A5"
## [17] "RPL23 /// RPL23P8 /// SNORA21" "AKR7A2"
## [19] "ZNF787" "NUP188"
## [21] "ACTA2" "APOF"
## [23] "HOXD13" "BSN"
## [25] "PPAP2B" "CAPN1"
## [27] "CNKSR2" "USP6NL /// USP6NL-IT1"
## [29] "FLRT2 /// LOC100506718" "TNFRSF8"
## [31] "MAGEC1" "ARSF"
## [33] "BRS3" "C2CD2L"
## [35] "KRT18" "DLEC1"
## [37] "SYT17" "ANXA11"
## [39] "KIR3DL1 /// KIR3DL2" "ARHGAP29"
## [41] "MXRA5" "IFI44L"
## [43] "HAAO" "CKMT2"
## [45] "DBI" "CUL3"
## [47] "DOPEY2" "CYTH2"
## [49] "TUBB2B" "TNFRSF14"
## [51] "C5orf45" "CAPRIN1"
## [53] "ITGA5" "GBF1"
## [55] "HEXIM1" "P2RX5"
## [57] "ID2" "CHD9"
## [59] "MT1F /// MT1G /// MT1M /// MT1P3" "GATA6"
## [61] "MSX2P1" "Fas"
# return the peakheight and location of the gene CDK13
c(peakdet$peakheight[which(peakdet$peakgenes == 'CDK13')],
peakdet$peakloc[which(peakdet$peakgenes == 'CDK13')])
## [1] 4114.631 43.000
# assess the main peak neighborhood for sustained activation
peakdet$neighbors[which(peakdet$peakgenes == 'CDK13'),]
## [1] 0 4 0 0 1
The plotExpression function allows to plot the expression signal of individual genes:
Gene co-expression analyses are a widely used tool in genomic studies. In this package we provide the getCormat function which calculates a gene-wise correlation matrix and plots a bi-clustered heatmap. The function returns both the heatmap object and the re-ordered correlation matrix:
## null device
## 1
corobjects <- getCormat(peakdet, exprmat, method = 'spearman')
# extract heatmap object
corheatmap <- corobjects$hm
# extract re-ordered correlation matrix
cormatrix <- corobjects$hm_cormat
# inspect the first 5 x 5 gene-wise correlation of the correlation matrix returned by plotCormat
cormatrix[1:5,1:5]
## TNNT1 ST3GAL5 CHD2
## RIOK3 -0.6 -0.6 -0.6
## JMJD6 -0.6 -0.6 -0.6
## IL23A /// TRBC2 /// TRBC2 -0.6 -0.6 -0.6
## IL1R1 -0.6 -0.6 -0.6
## COPS5 -0.6 -0.6 -0.6
## LOC101929087 /// SUMO2 /// SUMO3 HHLA1
## RIOK3 -0.6 -0.7
## JMJD6 -0.6 -0.7
## IL23A /// TRBC2 /// TRBC2 -0.6 -0.7
## IL1R1 -0.6 -0.7
## COPS5 -0.6 -0.7
The peakDetection algorithm represents a powerful tool for the exhaustive identification of condition or time-point specific biomarkers. For the analysis of the temporal gene expression and given enough time-points, those time-points can be further grouped together to reflect waves of gene activation, such as an immediate, early, mid and late response. Since in most explorative studies the number of such waves underlying a given biological process is unknown, the Biopeak package provides a cluster exploration function: findClusters. The findClusters function estimates the number of genes with similar temporal regulation and supports three different clustering algorithms: kmeans, dbscan and hierarchical clustering. Clustering is based on a PCA projection of the input data and cluster assignment and parameters are returned by the function.
# display all plots in one graph
par(mfrow = c(3,1))
# cluster exploration using hierarchical clustering based on 3 clusters
clusters <-findClusters(peakdet, exprmat, method = 'hclust', clusters = 3)
# cluster exploration using dbscan (density-based) with an epsilon parameter of 0.02
clusters <- findClusters(peakdet, exprmat, method = 'dbscan', eps = 0.1)
# cluster exploration using kmeans with a maximum of 5 clusters to be assigned
clusters <- findClusters(peakdet, exprmat, maxclusters = 3, method = 'kmeans')
The plotHeatmap function represents a simple wrapper function that takes the output of the peakDetection function and normalizes the expression to log2 values for improved visualization before subjecting it to the heatmap.2 function. The clustermembers parameter is optional and allows to select for specific genes. Here, we take the output of the findClusters function and plot a heatmap for one of the clusters:
The function saveOutput saves the output of the peakDetetection funtion (peakgenes, peaklocation and peakheight) to a text file.
J. M. Laramie, B. Brownstein, T. P. Chung. 2008. Transcriptional Profiles of Human Epithelial Cells in Response to Heat. Shock.