Network Fingerprint, a knowledge based characterization of biomedical networks in R

Introduction

Recent network fingerprint method based on GO knowledge and propagation clustering mentioned in (Cui et al. 2015) provide a novel representation of network differentiation known as biological spectra, or Network Fingerprint. This method was used to describe the relationship between multiple disease networks and its related pathways, and to visually compare and parse different diseases by generating a fingerprint overlay. Thus, we achieve the function of complex network comparison based on network fingerprint on the open scientific computing platform R, and present NFP package for fingerprint-based network analyzing and comparison of systems. Driven by the research needs of customers, NFP provides a unified interface to three similarity clustering algorithms. In addition, NFP can also provide multiscale statistical analysis and visualization, access to the specific attributes of different disease networks.

This manual is a brief introduction to structure, functions and usage of NFP package. The NFP package provides a set of functions to support knowledge-based network fingerprint (NFP) framework. A biomedical network is characterized as a spectrum-like vector called “network fingerprint”, which contains similarities to basic reference networks. This framework provides a more intuitive way to decipher molecular networks, especially for large-scale network comparisons and clustering analyses.

The three main features of NFP:

Basic reference networks generation.
Network comparison, which encompasses network merging, annotation and similarity scoring.
Network standardization.

Installation

NFP requires these packages: magrittr, igraph, plyr, ggplot2, apcluster, dplyr, stringr, graph and KEGGgraph. To install NFP, please note especially two depencies of NFP, graph and KEGGgraph are only available from Bioconductor. Appanrantly, function install.packages()can not insall Biocondutor packages. There is a biocLite(), a wrapper around install.packages() provided by Bioconductor, can be used to install both CRAN and Bioconductor packages simply. More details on biocLite() is available from https://www.bioconductor.org/install/. Thus, users can install NFP install the latest released version using biocLite() directly:

## install release version of NFP
source("http://bioconductor.org/biocLite.R")
biocLite("NFP")

or install the Bioconductor dependencies package first:

## install release version of NFP
source("http://bioconductor.org/biocLite.R")
biocLite(c("graph", "KEGGgraph"))
install.packages("NFP")

It also allows users to install the latest development version from github, which requires devtools package has been installed on your system or can be installed using install.packages("devtools"). Note that devtools sometimes needs some extra non-R software on your system – more specifically, an Rtools download for Windows or Xcode for OS X. There’s more information about devtools here.

## install NFP from github, require biocondutor dependencies
## package pre-installed
if (!require(devtools)) install.packages("devtools")
devtools::install_github("yiluheihei/NFP")

After installation, the *{NFP} is ready to load into the current workspace by the following codes to the current workspace by typing or pasting the following codes:

Moreover, gene similarity data used in our NFP package is stored in a external data repository NFPdata https://github.com/yiluheihei/datarepo for the large size (about 16 MB). More details on how to construct External Data Repositories using the Additional_repositories field see The Coatless Professor blog post. Thus, users must install the NFPdata before the networkfinger print analyis as following code.

if (!require("NFPdata")) {
    install_data_package()
}

Analysis Pipeline: from Basic Reference Network Generation to Network

Networkfingerprint Visualization

We will demonstrate go through an analysis pipeline to illustrate some of the main functions in NFP. This pipeline consists of several steps:

Basic Reference Network Generation: prepare the well-known biomedical netowks as the NFP framework reference networks. Several pathway databases have been developed for biological network research, e.g. KEGG, Reactome - (https://reactome.org). All of this pathway databases is well-stuied and can be used as the basic reference networks of NFP.
Network fingerprint calculation: The similarity between two biomedical networks is calculated based on the following intition: grouping the nodes in the merged network into strongly inter-connected communities with high functional similarity score between intra-community nodes in different networks The functional similarity was measured based on GO (Ashburner et al. 2000). And we employed affinity propagation (AP) (Frey and Dueck 2007) clustering algorithm to detect the aligned functional modules between the two networks to be compared.
Network fingerprint Visualization: Show the network fingerprint along all the reference networks. We could observe the differences among biological networks fingerprint intuitively from visualization.

Generating well-studied basic reference networks

The basic idea of calculating the network fingerprint is to have the biomedical networks map to well-studied basic networks. KEGG PATHWAY is a collection of manually drawn pathway maps representing our knowledge on the molecular interaction and reaction networks. Since its first introduction in 1995, KEGG PATHWAY has been widely used as a reference knowledge base for understanding biological pathways and functions of cellular processes. The knowledge from KEGG has proven of great value by numerous work in a wide range of fields (Kanehisa et al. 2007). So by default, we take KEGG pathways as basic reference networks in NFP by default.

Function load_KEGG_refnet() can be used to retrieve the KEGG pathway maps with KEGG API http://www.kegg.jp/kegg/rest/keggapi.html. In KEGG pathways, only the pathways of the map are manual drawing, and to different organisms, the map reference helps the automatic generation of organism-specific pathway for each organism. The organism (e.g. organsim ="hsa") parameter indicate the organism name of KEGG pathway maps.

## donot run, retrive pathway maps from KEGG database may take
## several minutes, we have pre-stored this data in our
## package kegg_refnet <- load_KEGG_refnet(organism = 'hsa')
data(kegg_refnet)
# show the kegg reference networks
show(kegg_refnet)
#> Basic networks of organism hsa 
#> 
#> 134 basic networks; classification into 4 groups: 
#>                                  group_name net_num
#> group1       Genetic Information Processing      22
#> group2 Environmental Information Processing      28
#> group3                   Cellular Processes      15
#> group4                   Organismal Systems      69
#>                                                      net_name
#> group1             RNA polymerase - Homo sapiens (human), ...
#> group2           ABC transporters - Homo sapiens (human), ...
#> group3                Endocytosis - Homo sapiens (human), ...
#> group4 Hematopoietic cell lineage - Homo sapiens (human), ...

We defined a new S4 class NFPrefnet to store the NFP reference networks. NFP also provides five kinds of methods for this S4 class:

net(): Exact the basic reference networks of NFPRefnet.
group(): Obtain the group information, group names, number and the size of each group,e.g. KEGG pathway database contains seven group pathway maps.
subnet(): Extract or replace parts of the NFP basic reference networks.
show(): Display of NFPRefnet.
name(): Extract the names of reference networks.

## group information of kegg reference networks
refnet_group <- group(kegg_refnet)
show(refnet_group)
#> $name
#> [1] "Genetic Information Processing"      
#> [2] "Environmental Information Processing"
#> [3] "Cellular Processes"                  
#> [4] "Organismal Systems"                  
#> 
#> $num
#> [1] 4
#> 
#> $size
#>       Genetic Information Processing Environmental Information Processing 
#>                                   22                                   28 
#>                   Cellular Processes                   Organismal Systems 
#>                                   15                                   69
## select goup 1 and 2, and subset this two groups
selected_group <- refnet_group$name[c(1, 2)]
NFPnet <- subnet(kegg_refnet, selected_group)
NFPnet
#> Basic networks of organism hsa 
#> 
#> 50 basic networks; classification into 2 groups: 
#>                                  group_name net_num
#> group1       Genetic Information Processing      22
#> group2 Environmental Information Processing      28
#>                                            net_name
#> group1   RNA polymerase - Homo sapiens (human), ...
#> group2 ABC transporters - Homo sapiens (human), ...

Detailed instructions for this five methods refer to package function help.

Obviously users can also customize a NFPRefnet as a reference for computing network fingerprint. Users can refer to the documents of NFPRefnet about the composition details of this class. graphite (Sales et al. 2012) allow users to build graphNEL object from several pathway databases.

## Reactome human pathway maps
require(graphite)
human_pathway <- pathways("hsapiens", "kegg")
## just choose first two pathway maps for testing
p <- human_pathway[1:2]
show(p)
#> KEGG pathways for hsapiens
#> 2 entries, retrieved on 17-04-2019
g <- lapply(p, pathwayGraph)
show(g)
#> $`Glycolysis / Gluconeogenesis`
#> A graphNEL graph with directed edges
#> Number of Nodes = 68 
#> Number of Edges = 728 
#> 
#> $`Citrate cycle (TCA cycle)`
#> A graphNEL graph with directed edges
#> Number of Nodes = 30 
#> Number of Edges = 196

Then users can create their own customized object as following:

## here, just take the above two reactome pathway maps as NFP
## basic reference networks as example
g_names <- names(human_pathway)[1:2]
## only one group and two reference networks
customized_refnet <- new("NFPRefnet", network = list(g), name = list(g_names), 
    group = "test group", organism = "hsa")
## methods of NFPRefnet
show(customized_refnet)
#> Basic networks of organism hsa 
#> 
#> 2 basic networks; classification into 1 groups: 
#>        group_name net_num                          net_name
#> group1 test group       2 Glycolysis / Gluconeogenesis, ...
group(customized_refnet)
#> $name
#> [1] "test group"
#> 
#> $num
#> [1] 1
#> 
#> $size
#> [1] 2
subnet(customized_refnet, "test group", 1)
#> Basic networks of organism hsa 
#> 
#> 1 basic networks; classification into 1 groups: 
#>        group_name net_num                          net_name
#> group1 test group       1 Glycolysis / Gluconeogenesis, ...

Network fingerprint calculation

NFP algorithm consists of three steps: merging network, nodes clustering and similarity scoring.

Network merging. The two networks to be compared are first merged into one. Given two networks G1 and G2, the merged network Gm is constructed by connecting each node between the G1 and G2 network. Two nodes corresponding to the same protein in the merged network are replaced by a single node that inherited all the interactions from the two individual nodes in the subsequent process.

Clustering in merged network. Grouping the nodes in the merged network into strongly inter-connected communities with high functional similarity score between intra-community nodes in different networks. We employed affinity propagation (AP) clustering algorithm to detect the aligned functional modules between the two networks to be compared. The nodes are grouped on the cluster based on nearest neighbor analysis.

Similarity scoring. The calculation of similarity score is processed in two steps: First, local similarity for each cluster and network similarity among cluster. Second, standardization: the original similarity score depends on the topological properties of query network to some extent. There is implicit bias of network fingerprint, because the outliers could be greatly distorted the relevant pattern presented in the network fingerprint. In order to eliminate the possible topological weight differences, the similarity calculation process of each node are standardized processing and the final network fingerprint facing to users is totally standardized. The standardization process is based on the random distribution of similarity scores. To the number of nodes, the number of edges and node degree, these three topological properties of random network for standardized estimate are consistent with the original network.

To not affect the results of standardization and improve the efficiency of fingerprint calculation, we set limit on the permutation time (the default is 100) of background network randomization in the standardization process. Users can also adjust randomization time of background network according to their own demands for the precision of network fingerprint.

NFP provides the calc_sim_score() function for calculating the similarity score between two networks. As the similarity score is subjected to the size of the network, we use the maslov’s method (Maslov and Sneppen 2002) to randomize a network while preserving the degree distribution. The nperm parameter is added to calc_sim_score() refers to the permutation times (the default is 100) of random network while calculating the similarity score mentioned above. We define a S4 class NFP in our package to store the calculation results of network fingerprints.

Simply, we choose the two pathway maps g as a query network, and a subset networks of kegg_refnet as the reference networks. Then the NFP can be calculated as following:

## set g as the query network
query_net <- g
## a subset of kegg_refnet, select the head five networks of
## group 1, 2
group_names <- group(kegg_refnet)$name
sample_NFPRefnet <- subnet(kegg_refnet, group_names[1:2], list(1:5, 
    1:5))
## In order to save calculating time, we take nperm = 10
NFP_score <- lapply(query_net, calc_sim_score, NFPnet = sample_NFPRefnet, 
    nperm = 10)
## methods of NFP class
show(NFP_score[[1]])
randomized_score <- perm_score(NFP_score[[1]])
cluster <- cluster_info(NFP_score[[1]])

Network fingerprint visualization

NFP provides the plot_NFP() function to visualize the network fingerprint of a single query network.

plot_NFP(NFP_score[[1]])

knitr::include_graphics("nfp_plot.png")

Session Information

The version number of R and packages loaded for generating the vignette were:

#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Mojave 10.14.5
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] C/zh_CN.UTF-8/zh_CN.UTF-8/C/zh_CN.UTF-8/zh_CN.UTF-8
#> 
#> attached base packages:
#> [1] parallel  stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#> [1] graphite_1.30.0     NFP_0.99.3          graph_1.62.0       
#> [4] BiocGenerics_0.30.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_0.2.5     xfun_0.9             purrr_0.3.2         
#>  [4] lattice_0.20-38      colorspace_1.4-1     vctrs_0.2.0         
#>  [7] htmltools_0.3.6      stats4_3.6.1         yaml_2.2.0          
#> [10] blob_1.2.0           XML_3.98-1.20        rlang_0.4.0         
#> [13] pillar_1.4.2         glue_1.3.1           DBI_1.0.0           
#> [16] rappdirs_0.3.1       bit64_0.9-7          lifecycle_0.1.0     
#> [19] plyr_1.8.4           stringr_1.4.0        munsell_0.5.0       
#> [22] gtable_0.3.0         memoise_1.1.0.9000   evaluate_0.14       
#> [25] Biobase_2.44.0       knitr_1.24           IRanges_2.18.2      
#> [28] AnnotationDbi_1.46.1 Rcpp_1.0.2           scales_1.0.0        
#> [31] backports_1.1.4      checkmate_1.9.4      formatR_1.7         
#> [34] S4Vectors_0.22.0     apcluster_1.4.8      bit_1.1-14          
#> [37] png_0.1-7            ggplot2_3.2.1        digest_0.6.21       
#> [40] stringi_1.4.3        dplyr_0.8.3          grid_3.6.1          
#> [43] tools_3.6.1          bitops_1.0-6         magrittr_1.5        
#> [46] RCurl_1.95-4.12      lazyeval_0.2.2       tibble_2.1.3        
#> [49] RSQLite_2.1.2        crayon_1.3.4         tidyr_1.0.0         
#> [52] pkgconfig_2.0.3      zeallot_0.1.0        Matrix_1.2-17       
#> [55] KEGGgraph_1.44.0     assertthat_0.2.1     rmarkdown_1.15      
#> [58] httr_1.4.1           R6_2.4.0             igraph_1.2.4.1      
#> [61] compiler_3.6.1

References

Ashburner, Michael, Catherine A Ball, Judith A Blake, David Botstein, Heather Butler, J Michael Cherry, Allan P Davis, et al. 2000. “Gene Ontology: Tool for the Unification of Biology.” Nature Genetics 25 (1). Nature Publishing Group: 25.

Cui, Xiuliang, Haochen He, Fuchu He, Shengqi Wang, Fei Li, and Xiaochen Bo. 2015. “Network Fingerprint: A Knowledge-Based Characterization of Biomedical Networks.” Scientific Reports 5. Nature Publishing Group: 13286.

Frey, Brendan J, and Delbert Dueck. 2007. “Clustering by Passing Messages Between Data Points.” Science 315 (5814). American Association for the Advancement of Science: 972–76.

Kanehisa, Minoru, Michihiro Araki, Susumu Goto, Masahiro Hattori, Mika Hirakawa, Masumi Itoh, Toshiaki Katayama, et al. 2007. “KEGG for Linking Genomes to Life and the Environment.” Nucleic Acids Research 36 (suppl_1). Oxford University Press: D480–D484.

Maslov, Sergei, and Kim Sneppen. 2002. “Specificity and Stability in Topology of Protein Networks.” Science 296 (5569). American Association for the Advancement of Science: 910–13.

Sales, Gabriele, Enrica Calura, Duccio Cavalieri, and Chiara Romualdi. 2012. “G Raphite-a Bioconductor Package to Convert Pathway Topology to Gene Network.” BMC Bioinformatics 13 (1). BioMed Central: 20.