Case Studies

Our example data set comes from a breast cancer study (TCGA 2012), which consists of gene expression data from 520 subjects including 117 estrogen-receptor-negative (ER-) and 403 estrogen-receptor-positive (ER+). The goal is to generate a diagnostic pathway signatures that could distinguish patients with different ER statuses by comparing gene expression data from the two classes. These signatures could provide additional clinical benefit in diagnosing breast cancer.

Preparation

We first load necessary R/Bioconductor packages. Additional packages such as msigdbr are useful if you would like to import pathways from MSigDB.

library(glassoFast)
library(graphite)
library(igraph)
library(msigdbr) 
library(netgsa)
library(Rgraphviz)

Step 1: loading RNA-seq data

NetGSA works directly with the expression data matrix. When loading the data, it is important to check the distribution of raw sequencing reads and perform log transformation if necessary. Data in this example were already log transformed. It is also important to label the rows and columns of the data matrix. Rows of the data matrix correspond to genes, whereas columns to subjects. Genes in this data matrix were labeled with Entrez IDs, same as those used in KEGG pathways.

data(breastcancer2012, package = "netgsa")
ls()

## [1] "edgelist"    "g"           "group"       "nonedgelist" "pathways"   
## [6] "x"

The variables in this data object include

the data matrix (x), with rows for genes and columns for samples,
the group labels (group) for ER status,
a directed acyclic graph from KEGG (g),
the data frame consisting of known edges (edgelist),
the data frame consisting of (a subset of) the non-edges (nonedgelist),
a list of KEGG pathways indexed with Entrez gene IDs (pathways).

head(rownames(x))

## [1] "127550" "53947"  "65985"  "51166"  "15"     "60496"

We can find out the ER status by looking at the group labels.

table(group)

## group
##   1   2 
## 403 117

The two data frames nonedgelist and edgelist consist of edges (nonedges) extracted from user provided sources. Each row represents one edge. The first column represents the starting node, with the second column being the end node. The last column indicates whether the edge is directed. As of now, NetGSA assumes that all edges are either simultaneously undirected or simultaneously directed, as determined by the first edge.

head(edgelist)

##   src  dest  direction
## 1  15  7534 undirected
## 2  31 57016 undirected
## 3  31  5563 undirected
## 4  60    71 undirected
## 5  60   867 undirected
## 6  60  4609 undirected

Step 2: pathway enrichment analysis

Enrichment analysis with undirected networks

We consider pathways from KEGG (Kanehisa and Goto 2000). KEGG pathways can be accessed in R using the graphite package.

paths <- pathways('hsapiens','kegg')
paths[[1]]

## "Glycolysis / Gluconeogenesis" pathway
## Native ID       = hsa:00010
## Database        = KEGG
## Species         = hsapiens
## Number of nodes = 94
## Number of edges = 1191
## Retrieved on    = 18-11-2018
## URL             = http://www.kegg.jp/kegg-bin/show_pathway?org_name=hsa&mapno=00010

head(nodes(paths[[1]]))

## [1] "ENTREZID:10327" "ENTREZID:124"   "ENTREZID:125"   "ENTREZID:126"  
## [5] "ENTREZID:127"   "ENTREZID:128"

Alternatively, one can also use the function preparePathways to import pathways from KEGG.

pathwayList <- preparePathways('kegg')
head(pathwayList[[1]])

## [1] "10327" "124"   "125"   "126"   "127"   "128"

Note if one chooses to import pathways from MSigDB, genes in the resulting pathways will be labeled using gene symbols. As a result, the data matrix should also use the same set of gene names.

pathwayList <- preparePathways('MSigDB')
head(pathwayList[[1]])

## [1] "19"    "11194" "10449" "33"    "34"    "35"

For the purpose of this tutorial, we use the pathways stored in pathways:

pathways[1:2]

## $`Adipocytokine signaling pathway`
##  [1] "10645"  "1147"   "126129" "1374"   "1375"   "2475"   "32"    
##  [8] "3551"   "3717"   "3952"   "3953"   "4790"   "51094"  "51422" 
## [15] "53632"  "5465"   "5562"   "5563"   "5564"   "5565"   "5571"  
## [22] "5588"   "5599"   "5601"   "5602"   "5970"   "6774"   "6794"  
## [29] "7124"   "7132"   "7133"   "7186"   "79602"  "84254"  "8517"  
## [36] "8717"   "9021"   "9370"   "3667"   "4792"   "4793"   "4794"  
## [43] "8471"   "8660"   "2180"   "2181"   "2182"   "23205"  "23305" 
## [50] "51703"  "81616"  "5781"   "181"    "2538"   "4852"   "5105"  
## [57] "5106"   "57818"  "6513"   "6517"   "92579"  "10891"  "5443"  
## 
## $`Adrenergic signaling in cardiomyocytes`
##   [1] "10411"  "107"    "108"    "109"    "11069"  "111"    "112"   
##   [8] "113"    "114"    "115"    "1390"   "146"    "147"    "148"   
##  [15] "153"    "154"    "163688" "185"    "186"    "196883" "23236" 
##  [22] "23533"  "2770"   "2771"   "2773"   "2776"   "2778"   "28227" 
##  [29] "51806"  "5290"   "5291"   "5293"   "5294"   "5295"   "5296"  
##  [36] "5330"   "5331"   "5332"   "5350"   "5499"   "5500"   "5501"  
##  [43] "55012"  "5502"   "5515"   "5516"   "5518"   "5519"   "5520"  
##  [50] "5521"   "5522"   "5523"   "5525"   "5526"   "5527"   "5528"  
##  [57] "5529"   "5566"   "5567"   "5568"   "5578"   "55844"  "5594"  
##  [64] "5595"   "70"     "7134"   "7137"   "7139"   "7168"   "7169"  
##  [71] "7170"   "7171"   "801"    "805"    "808"    "810"    "815"   
##  [78] "816"    "817"    "818"    "8503"   "91860"  "9252"   "1432"  
##  [85] "5600"   "5603"   "6300"   "10488"  "1385"   "1386"   "1388"  
##  [92] "148327" "468"    "596"    "64764"  "84699"  "90993"  "9586"  
##  [99] "10000"  "207"    "208"    "10368"  "10369"  "27091"  "27092" 
## [106] "55799"  "59283"  "59284"  "59285"  "6262"   "775"    "776"   
## [113] "778"    "779"    "781"    "782"    "783"    "784"    "785"   
## [120] "786"    "9254"   "93589"  "488"    "23439"  "3753"   "3784"  
## [127] "476"    "477"    "478"    "480"    "481"    "482"    "483"   
## [134] "486"    "490"    "491"    "492"    "493"    "6324"   "6330"  
## [141] "6331"   "6332"   "6546"   "6548"   "4624"   "4625"   "4633"  
## [148] "4634"   "4635"

To prepare the adjacency matrices needed for the function NetGSA, one can read existing edges from file. For example, suppose the edges in edgelist are stored in a csv file. The function prepareAdjacencyMatrix can input edges from external files or import edges from KEGG data base.

write.csv(edgelist,file='edgelist.txt',row.names = FALSE)
out <- prepareAdjacencyMatrix(x, group, pathways, FALSE, 'edgelist.txt', NULL)

The object pathways has in total 100 pathways. To illustrate the use of NetGSA, consider only genes from ErbB signaling pathway and Jak-STAT signaling pathway. Given the pathways, one first estimates the underlying networks (represented as weighted adjacency matrices), while incorporating available network information.

genenames <- unique(c(pathways[[24]], pathways[[52]]))
genenames <- intersect(genenames, rownames(x))
p <- length(genenames)
p

## [1] 192

sx <- x[match(genenames, rownames(x)),]
sout <- prepareAdjacencyMatrix(sx, group, pathways, FALSE, 'edgelist.txt', NULL)

prepareAdjacencyMatrix returns the pathway indicator matrix that will be used in the NetGSA function. In this example, due to the high overlapping among pathways, the selected genes actually cover 41 pathways.

# pathway indicator matrix
dim(sout$B)

## [1]  41 192

The returned 0-1 adjacency matrices from prepareAdjacencyMatrix can be used as network information for estimating the partial correlation network under each condition. We recommend using carefully chosen tuning parameters as this yields better estimates of the networks.

ncond <- length(unique(group))
Amat <- vector("list",ncond)
sx <- sx[match(colnames(sout$B), rownames(sx)),]

for (k in 1:ncond){
  data_c <- sx[,(group==k)]
  # select the tuning parameter
  fitBIC <- bic.netEst.undir(data_c,one=sout$Adj,
                             lambda=seq(1,10)*sqrt(log(p)/ncol(data_c)),eta=0.1)
  # refit the network
  fit <- netEst.undir(data_c,one=sout$Adj,
                      lambda=which.min(fitBIC$BIC)*sqrt(log(p)/ncol(data_c)),eta=0.1)
  Amat[[k]] <- fit$Adj
}

Given the networks, one can test for pathway enrichment as follows:

test1 <- NetGSA(Amat, sx, group, pathways = sout$B, lklMethod = 'REHE')
head(test1$results)

##                             pathway pSize         pval         pFdr
## 1         Calcium signaling pathway    12 9.152277e-21 3.752434e-19
## 2           Hippo signaling pathway     6 3.709051e-19 7.603555e-18
## 3       Chemokine signaling pathway    38 1.015732e-18 1.388167e-17
## 4    Neurotrophin signaling pathway    45 1.386456e-17 1.421117e-16
## 5         Insulin signaling pathway    44 6.438692e-13 4.399773e-12
## 6 Phospholipase D signaling pathway    33 6.300883e-13 4.399773e-12

Note prepareAdjacencyMatrix can also estimate the networks if the argument estimate_network=TRUE. In such cases, prepareAdjacencyMatrix returns the weighted adjacency matrices estimated under a fixed set of tuning parameters, which may not be optimal. The resulting adjacency matrices can be directly used in NetGSA for pathway enrichment.

sout <- prepareAdjacencyMatrix(sx, group, pathways, FALSE, 'edgelist.txt', NULL, estimate_network=TRUE, lambda_c = 9, eta=0.1)
test2 <- NetGSA(sout$Amat, sx, group, pathways = sout$B, lklMethod = 'REHE')
head(test2$results)

##                                 pathway pSize         pval         pFdr
## 1           Chemokine signaling pathway    38 1.847896e-22 7.576374e-21
## 2             Calcium signaling pathway    12 7.942856e-20 1.628285e-18
## 3               Hippo signaling pathway     6 1.983777e-19 2.711163e-18
## 4        Neurotrophin signaling pathway    45 1.417895e-18 1.453343e-17
## 5 Phosphatidylinositol signaling system    10 1.609605e-13 1.319876e-12
## 6     Phospholipase D signaling pathway    33 8.362044e-13 5.714064e-12

Enrichment analysis with directed networks

NetGSA can also handle directed networks. For example,

# e.g. the "Adrenergic signaling in cardiomyocytes" pathway from KEGG is a DAG.
print(is_dag(g))

## [1] TRUE

genenames <- V(g)$name
p <- length(genenames)

# reorder the variables and get the adjacency matrix
reOrder <- topo_sort(g,"in")
Adj <- as.matrix(get.adjacency(g))
Adj <- Adj[reOrder,reOrder]

B <- matrix(rep(1,p),nrow=1)
rownames(B) <- "Adrenergic signaling in cardiomyocytes"
colnames(B) <- rownames(Adj)
gx <- x[match(rownames(Adj), rownames(x)),]

Amat <- vector("list", 2)
for (k in 1:2){
  data_c <- gx[,which(group==k)]
  Amat[[k]] <- netEst.dir(data_c, one = Adj)$Adj
}
test <- NetGSA(Amat, gx, group, pathways = B, lklMethod = 'REHE')

References

Kanehisa, Minoru, and Susumu Goto. 2000. “KEGG: Kyoto Encyclopedia of Genes and Genomes.” Nucleic Acids Research 28 (1). Oxford University Press: 27–30.

Ma, Jing, Ali Shojaie, and George Michailidis. 2016. “Network-Based Pathway Enrichment Analysis with Incomplete Network Information.” Bioinformatics 32 (20). Oxford University Press: 3165–74.

TCGA. 2012. “Comprehensive Molecular Portraits of Human Breast Tumours.” Nature 490 (7418). Nature Publishing Group: 61.

netgsa: Network-based Gene Set Analysis

Jing Ma

2019-03-18

Introduction