In this vignette, we demonstrate the NetGSA workflow using a breast cancer data example downloaded from the Cancer Genome Atlas (TCGA 2012). In particular, we illustrate how to incorporate existing network information (e.g. curated edges from KEGG) to improve the power of pathway enrichment analysis with NetGSA. Details of the method are available in Ma, Shojaie, and Michailidis (2016).
Our example data set comes from a breast cancer study (TCGA 2012), which consists of gene expression data from 520 subjects including 117 estrogen-receptor-negative (ER-) and 403 estrogen-receptor-positive (ER+). The goal is to generate a diagnostic pathway signatures that could distinguish patients with different ER statuses by comparing gene expression data from the two classes. These signatures could provide additional clinical benefit in diagnosing breast cancer.
We first load necessary R/Bioconductor packages. Additional packages such as msigdbr are useful if you would like to import pathways from MSigDB.
NetGSA works directly with the expression data matrix. When loading the data, it is important to check the distribution of raw sequencing reads and perform log transformation if necessary. Data in this example were already log transformed. It is also important to label the rows and columns of the data matrix. Rows of the data matrix correspond to genes, whereas columns to subjects. Genes in this data matrix were labeled with Entrez IDs, same as those used in KEGG pathways.
## [1] "edgelist" "g" "group" "nonedgelist" "pathways"
## [6] "x"
The variables in this data object include
x
), with rows for genes and columns for samples,group
) for ER status,g
),edgelist
),nonedgelist
),pathways
).## [1] "127550" "53947" "65985" "51166" "15" "60496"
We can find out the ER status by looking at the group labels.
## group
## 1 2
## 403 117
The two data frames nonedgelist
and edgelist
consist of edges (nonedges) extracted from user provided sources. Each row represents one edge. The first column represents the starting node, with the second column being the end node. The last column indicates whether the edge is directed. As of now, NetGSA assumes that all edges are either simultaneously undirected or simultaneously directed, as determined by the first edge.
## src dest direction
## 1 15 7534 undirected
## 2 31 57016 undirected
## 3 31 5563 undirected
## 4 60 71 undirected
## 5 60 867 undirected
## 6 60 4609 undirected
We consider pathways from KEGG (Kanehisa and Goto 2000). KEGG pathways can be accessed in R using the graphite package.
## "Glycolysis / Gluconeogenesis" pathway
## Native ID = hsa:00010
## Database = KEGG
## Species = hsapiens
## Number of nodes = 94
## Number of edges = 1191
## Retrieved on = 18-11-2018
## URL = http://www.kegg.jp/kegg-bin/show_pathway?org_name=hsa&mapno=00010
## [1] "ENTREZID:10327" "ENTREZID:124" "ENTREZID:125" "ENTREZID:126"
## [5] "ENTREZID:127" "ENTREZID:128"
Alternatively, one can also use the function preparePathways
to import pathways from KEGG.
## [1] "10327" "124" "125" "126" "127" "128"
Note if one chooses to import pathways from MSigDB, genes in the resulting pathways will be labeled using gene symbols. As a result, the data matrix should also use the same set of gene names.
## [1] "19" "11194" "10449" "33" "34" "35"
For the purpose of this tutorial, we use the pathways stored in pathways
:
## $`Adipocytokine signaling pathway`
## [1] "10645" "1147" "126129" "1374" "1375" "2475" "32"
## [8] "3551" "3717" "3952" "3953" "4790" "51094" "51422"
## [15] "53632" "5465" "5562" "5563" "5564" "5565" "5571"
## [22] "5588" "5599" "5601" "5602" "5970" "6774" "6794"
## [29] "7124" "7132" "7133" "7186" "79602" "84254" "8517"
## [36] "8717" "9021" "9370" "3667" "4792" "4793" "4794"
## [43] "8471" "8660" "2180" "2181" "2182" "23205" "23305"
## [50] "51703" "81616" "5781" "181" "2538" "4852" "5105"
## [57] "5106" "57818" "6513" "6517" "92579" "10891" "5443"
##
## $`Adrenergic signaling in cardiomyocytes`
## [1] "10411" "107" "108" "109" "11069" "111" "112"
## [8] "113" "114" "115" "1390" "146" "147" "148"
## [15] "153" "154" "163688" "185" "186" "196883" "23236"
## [22] "23533" "2770" "2771" "2773" "2776" "2778" "28227"
## [29] "51806" "5290" "5291" "5293" "5294" "5295" "5296"
## [36] "5330" "5331" "5332" "5350" "5499" "5500" "5501"
## [43] "55012" "5502" "5515" "5516" "5518" "5519" "5520"
## [50] "5521" "5522" "5523" "5525" "5526" "5527" "5528"
## [57] "5529" "5566" "5567" "5568" "5578" "55844" "5594"
## [64] "5595" "70" "7134" "7137" "7139" "7168" "7169"
## [71] "7170" "7171" "801" "805" "808" "810" "815"
## [78] "816" "817" "818" "8503" "91860" "9252" "1432"
## [85] "5600" "5603" "6300" "10488" "1385" "1386" "1388"
## [92] "148327" "468" "596" "64764" "84699" "90993" "9586"
## [99] "10000" "207" "208" "10368" "10369" "27091" "27092"
## [106] "55799" "59283" "59284" "59285" "6262" "775" "776"
## [113] "778" "779" "781" "782" "783" "784" "785"
## [120] "786" "9254" "93589" "488" "23439" "3753" "3784"
## [127] "476" "477" "478" "480" "481" "482" "483"
## [134] "486" "490" "491" "492" "493" "6324" "6330"
## [141] "6331" "6332" "6546" "6548" "4624" "4625" "4633"
## [148] "4634" "4635"
To prepare the adjacency matrices needed for the function NetGSA
, one can read existing edges from file. For example, suppose the edges in edgelist
are stored in a csv file. The function prepareAdjacencyMatrix
can input edges from external files or import edges from KEGG data base.
write.csv(edgelist,file='edgelist.txt',row.names = FALSE)
out <- prepareAdjacencyMatrix(x, group, pathways, FALSE, 'edgelist.txt', NULL)
The object pathways
has in total 100 pathways. To illustrate the use of NetGSA
, consider only genes from ErbB signaling pathway and Jak-STAT signaling pathway. Given the pathways, one first estimates the underlying networks (represented as weighted adjacency matrices), while incorporating available network information.
genenames <- unique(c(pathways[[24]], pathways[[52]]))
genenames <- intersect(genenames, rownames(x))
p <- length(genenames)
p
## [1] 192
sx <- x[match(genenames, rownames(x)),]
sout <- prepareAdjacencyMatrix(sx, group, pathways, FALSE, 'edgelist.txt', NULL)
prepareAdjacencyMatrix
returns the pathway indicator matrix that will be used in the NetGSA
function. In this example, due to the high overlapping among pathways, the selected genes actually cover 41 pathways.
## [1] 41 192
The returned 0-1 adjacency matrices from prepareAdjacencyMatrix
can be used as network information for estimating the partial correlation network under each condition. We recommend using carefully chosen tuning parameters as this yields better estimates of the networks.
ncond <- length(unique(group))
Amat <- vector("list",ncond)
sx <- sx[match(colnames(sout$B), rownames(sx)),]
for (k in 1:ncond){
data_c <- sx[,(group==k)]
# select the tuning parameter
fitBIC <- bic.netEst.undir(data_c,one=sout$Adj,
lambda=seq(1,10)*sqrt(log(p)/ncol(data_c)),eta=0.1)
# refit the network
fit <- netEst.undir(data_c,one=sout$Adj,
lambda=which.min(fitBIC$BIC)*sqrt(log(p)/ncol(data_c)),eta=0.1)
Amat[[k]] <- fit$Adj
}
Given the networks, one can test for pathway enrichment as follows:
## pathway pSize pval pFdr
## 1 Calcium signaling pathway 12 9.152277e-21 3.752434e-19
## 2 Hippo signaling pathway 6 3.709051e-19 7.603555e-18
## 3 Chemokine signaling pathway 38 1.015732e-18 1.388167e-17
## 4 Neurotrophin signaling pathway 45 1.386456e-17 1.421117e-16
## 5 Insulin signaling pathway 44 6.438692e-13 4.399773e-12
## 6 Phospholipase D signaling pathway 33 6.300883e-13 4.399773e-12
Note prepareAdjacencyMatrix
can also estimate the networks if the argument estimate_network=TRUE
. In such cases, prepareAdjacencyMatrix
returns the weighted adjacency matrices estimated under a fixed set of tuning parameters, which may not be optimal. The resulting adjacency matrices can be directly used in NetGSA
for pathway enrichment.
sout <- prepareAdjacencyMatrix(sx, group, pathways, FALSE, 'edgelist.txt', NULL, estimate_network=TRUE, lambda_c = 9, eta=0.1)
test2 <- NetGSA(sout$Amat, sx, group, pathways = sout$B, lklMethod = 'REHE')
head(test2$results)
## pathway pSize pval pFdr
## 1 Chemokine signaling pathway 38 1.847896e-22 7.576374e-21
## 2 Calcium signaling pathway 12 7.942856e-20 1.628285e-18
## 3 Hippo signaling pathway 6 1.983777e-19 2.711163e-18
## 4 Neurotrophin signaling pathway 45 1.417895e-18 1.453343e-17
## 5 Phosphatidylinositol signaling system 10 1.609605e-13 1.319876e-12
## 6 Phospholipase D signaling pathway 33 8.362044e-13 5.714064e-12
NetGSA
can also handle directed networks. For example,
## [1] TRUE
genenames <- V(g)$name
p <- length(genenames)
# reorder the variables and get the adjacency matrix
reOrder <- topo_sort(g,"in")
Adj <- as.matrix(get.adjacency(g))
Adj <- Adj[reOrder,reOrder]
B <- matrix(rep(1,p),nrow=1)
rownames(B) <- "Adrenergic signaling in cardiomyocytes"
colnames(B) <- rownames(Adj)
gx <- x[match(rownames(Adj), rownames(x)),]
Amat <- vector("list", 2)
for (k in 1:2){
data_c <- gx[,which(group==k)]
Amat[[k]] <- netEst.dir(data_c, one = Adj)$Adj
}
test <- NetGSA(Amat, gx, group, pathways = B, lklMethod = 'REHE')
Kanehisa, Minoru, and Susumu Goto. 2000. “KEGG: Kyoto Encyclopedia of Genes and Genomes.” Nucleic Acids Research 28 (1). Oxford University Press: 27–30.
Ma, Jing, Ali Shojaie, and George Michailidis. 2016. “Network-Based Pathway Enrichment Analysis with Incomplete Network Information.” Bioinformatics 32 (20). Oxford University Press: 3165–74.
TCGA. 2012. “Comprehensive Molecular Portraits of Human Breast Tumours.” Nature 490 (7418). Nature Publishing Group: 61.