Network Fingerprint, a knowledge based characterization of biomedical networks in R

Introduction

Recent network fingerprint method based on GO knowledge and propagation clustering mentioned in (Cui et al. 2015) provide a novel representation of network differentiation known as biological spectra, or Network Fingerprint. This method was used to describe the relationship between multiple disease networks and its related pathways, and to visually compare and parse different diseases by generating a fingerprint overlay. Thus, we achieve the function of complex network comparison based on network fingerprint on the open scientific computing platform R, and present NFP package for fingerprint-based network analyzing and comparison of systems. Driven by the research needs of customers, NFP provides a unified interface to three similarity clustering algorithms. In addition, NFP can also provide multiscale statistical analysis and visualization, access to the specific attributes of different disease networks.

This manual is a brief introduction to structure, functions and usage of NFP package. The NFP package provides a set of functions to support knowledge-based network fingerprint (NFP) framework. A biomedical network is characterized as a spectrum-like vector called “network fingerprint”, which contains similarities to basic reference networks. This framework provides a more intuitive way to decipher molecular networks, especially for large-scale network comparisons and clustering analyses.

The three main features of NFP:

Installation

NFP requires these packages: magrittr, igraph, plyr, ggplot2, apcluster, dplyr, stringr, graph and KEGGgraph. To install NFP, please note especially two depencies of NFP, graph and KEGGgraph are only available from Bioconductor. Appanrantly, function install.packages()can not insall Biocondutor packages. There is a biocLite(), a wrapper around install.packages() provided by Bioconductor, can be used to install both CRAN and Bioconductor packages simply. More details on biocLite() is available from https://www.bioconductor.org/install/. Thus, users can install NFP install the latest released version using biocLite() directly:

or install the Bioconductor dependencies package first:

It also allows users to install the latest development version from github, which requires devtools package has been installed on your system or can be installed using install.packages("devtools"). Note that devtools sometimes needs some extra non-R software on your system – more specifically, an Rtools download for Windows or Xcode for OS X. There’s more information about devtools here.

After installation, the *{NFP} is ready to load into the current workspace by the following codes to the current workspace by typing or pasting the following codes:

Moreover, gene similarity data used in our NFP package is stored in a external data repository NFPdata https://github.com/yiluheihei/datarepo for the large size (about 16 MB). More details on how to construct External Data Repositories using the Additional_repositories field see The Coatless Professor blog post. Thus, users must install the NFPdata before the networkfinger print analyis as following code.

Analysis Pipeline: from Basic Reference Network Generation to Network

Networkfingerprint Visualization

We will demonstrate go through an analysis pipeline to illustrate some of the main functions in NFP. This pipeline consists of several steps:

  1. Basic Reference Network Generation: prepare the well-known biomedical netowks as the NFP framework reference networks. Several pathway databases have been developed for biological network research, e.g. KEGG, Reactome - (https://reactome.org). All of this pathway databases is well-stuied and can be used as the basic reference networks of NFP.
  2. Network fingerprint calculation: The similarity between two biomedical networks is calculated based on the following intition: grouping the nodes in the merged network into strongly inter-connected communities with high functional similarity score between intra-community nodes in different networks The functional similarity was measured based on GO (Ashburner et al. 2000). And we employed affinity propagation (AP) (Frey and Dueck 2007) clustering algorithm to detect the aligned functional modules between the two networks to be compared.
  3. Network fingerprint Visualization: Show the network fingerprint along all the reference networks. We could observe the differences among biological networks fingerprint intuitively from visualization.

Generating well-studied basic reference networks

The basic idea of calculating the network fingerprint is to have the biomedical networks map to well-studied basic networks. KEGG PATHWAY is a collection of manually drawn pathway maps representing our knowledge on the molecular interaction and reaction networks. Since its first introduction in 1995, KEGG PATHWAY has been widely used as a reference knowledge base for understanding biological pathways and functions of cellular processes. The knowledge from KEGG has proven of great value by numerous work in a wide range of fields (Kanehisa et al. 2007). So by default, we take KEGG pathways as basic reference networks in NFP by default.

Function load_KEGG_refnet() can be used to retrieve the KEGG pathway maps with KEGG API http://www.kegg.jp/kegg/rest/keggapi.html. In KEGG pathways, only the pathways of the map are manual drawing, and to different organisms, the map reference helps the automatic generation of organism-specific pathway for each organism. The organism (e.g. organsim ="hsa") parameter indicate the organism name of KEGG pathway maps.

We defined a new S4 class NFPrefnet to store the NFP reference networks. NFP also provides five kinds of methods for this S4 class:

  1. net(): Exact the basic reference networks of NFPRefnet.
  2. group(): Obtain the group information, group names, number and the size of each group,e.g. KEGG pathway database contains seven group pathway maps.
  3. subnet(): Extract or replace parts of the NFP basic reference networks.
  4. show(): Display of NFPRefnet.
  5. name(): Extract the names of reference networks.

Detailed instructions for this five methods refer to package function help.

Obviously users can also customize a NFPRefnet as a reference for computing network fingerprint. Users can refer to the documents of NFPRefnet about the composition details of this class. graphite (Sales et al. 2012) allow users to build graphNEL object from several pathway databases.

Then users can create their own customized object as following:

Network fingerprint calculation

NFP algorithm consists of three steps: merging network, nodes clustering and similarity scoring.

Network merging. The two networks to be compared are first merged into one. Given two networks G1 and G2, the merged network Gm is constructed by connecting each node between the G1 and G2 network. Two nodes corresponding to the same protein in the merged network are replaced by a single node that inherited all the interactions from the two individual nodes in the subsequent process.

Clustering in merged network. Grouping the nodes in the merged network into strongly inter-connected communities with high functional similarity score between intra-community nodes in different networks. We employed affinity propagation (AP) clustering algorithm to detect the aligned functional modules between the two networks to be compared. The nodes are grouped on the cluster based on nearest neighbor analysis.

Similarity scoring. The calculation of similarity score is processed in two steps: First, local similarity for each cluster and network similarity among cluster. Second, standardization: the original similarity score depends on the topological properties of query network to some extent. There is implicit bias of network fingerprint, because the outliers could be greatly distorted the relevant pattern presented in the network fingerprint. In order to eliminate the possible topological weight differences, the similarity calculation process of each node are standardized processing and the final network fingerprint facing to users is totally standardized. The standardization process is based on the random distribution of similarity scores. To the number of nodes, the number of edges and node degree, these three topological properties of random network for standardized estimate are consistent with the original network.

To not affect the results of standardization and improve the efficiency of fingerprint calculation, we set limit on the permutation time (the default is 100) of background network randomization in the standardization process. Users can also adjust randomization time of background network according to their own demands for the precision of network fingerprint.

NFP provides the calc_sim_score() function for calculating the similarity score between two networks. As the similarity score is subjected to the size of the network, we use the maslov’s method (Maslov and Sneppen 2002) to randomize a network while preserving the degree distribution. The nperm parameter is added to calc_sim_score() refers to the permutation times (the default is 100) of random network while calculating the similarity score mentioned above. We define a S4 class NFP in our package to store the calculation results of network fingerprints.

Simply, we choose the two pathway maps g as a query network, and a subset networks of kegg_refnet as the reference networks. Then the NFP can be calculated as following:

Network fingerprint visualization

NFP provides the plot_NFP() function to visualize the network fingerprint of a single query network.

Session Information

The version number of R and packages loaded for generating the vignette were:

#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Mojave 10.14.5
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] C/zh_CN.UTF-8/zh_CN.UTF-8/C/zh_CN.UTF-8/zh_CN.UTF-8
#> 
#> attached base packages:
#> [1] parallel  stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#> [1] graphite_1.30.0     NFP_0.99.3          graph_1.62.0       
#> [4] BiocGenerics_0.30.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_0.2.5     xfun_0.9             purrr_0.3.2         
#>  [4] lattice_0.20-38      colorspace_1.4-1     vctrs_0.2.0         
#>  [7] htmltools_0.3.6      stats4_3.6.1         yaml_2.2.0          
#> [10] blob_1.2.0           XML_3.98-1.20        rlang_0.4.0         
#> [13] pillar_1.4.2         glue_1.3.1           DBI_1.0.0           
#> [16] rappdirs_0.3.1       bit64_0.9-7          lifecycle_0.1.0     
#> [19] plyr_1.8.4           stringr_1.4.0        munsell_0.5.0       
#> [22] gtable_0.3.0         memoise_1.1.0.9000   evaluate_0.14       
#> [25] Biobase_2.44.0       knitr_1.24           IRanges_2.18.2      
#> [28] AnnotationDbi_1.46.1 Rcpp_1.0.2           scales_1.0.0        
#> [31] backports_1.1.4      checkmate_1.9.4      formatR_1.7         
#> [34] S4Vectors_0.22.0     apcluster_1.4.8      bit_1.1-14          
#> [37] png_0.1-7            ggplot2_3.2.1        digest_0.6.21       
#> [40] stringi_1.4.3        dplyr_0.8.3          grid_3.6.1          
#> [43] tools_3.6.1          bitops_1.0-6         magrittr_1.5        
#> [46] RCurl_1.95-4.12      lazyeval_0.2.2       tibble_2.1.3        
#> [49] RSQLite_2.1.2        crayon_1.3.4         tidyr_1.0.0         
#> [52] pkgconfig_2.0.3      zeallot_0.1.0        Matrix_1.2-17       
#> [55] KEGGgraph_1.44.0     assertthat_0.2.1     rmarkdown_1.15      
#> [58] httr_1.4.1           R6_2.4.0             igraph_1.2.4.1      
#> [61] compiler_3.6.1

References

Ashburner, Michael, Catherine A Ball, Judith A Blake, David Botstein, Heather Butler, J Michael Cherry, Allan P Davis, et al. 2000. “Gene Ontology: Tool for the Unification of Biology.” Nature Genetics 25 (1). Nature Publishing Group: 25.

Cui, Xiuliang, Haochen He, Fuchu He, Shengqi Wang, Fei Li, and Xiaochen Bo. 2015. “Network Fingerprint: A Knowledge-Based Characterization of Biomedical Networks.” Scientific Reports 5. Nature Publishing Group: 13286.

Frey, Brendan J, and Delbert Dueck. 2007. “Clustering by Passing Messages Between Data Points.” Science 315 (5814). American Association for the Advancement of Science: 972–76.

Kanehisa, Minoru, Michihiro Araki, Susumu Goto, Masahiro Hattori, Mika Hirakawa, Masumi Itoh, Toshiaki Katayama, et al. 2007. “KEGG for Linking Genomes to Life and the Environment.” Nucleic Acids Research 36 (suppl_1). Oxford University Press: D480–D484.

Maslov, Sergei, and Kim Sneppen. 2002. “Specificity and Stability in Topology of Protein Networks.” Science 296 (5569). American Association for the Advancement of Science: 910–13.

Sales, Gabriele, Enrica Calura, Duccio Cavalieri, and Chiara Romualdi. 2012. “G Raphite-a Bioconductor Package to Convert Pathway Topology to Gene Network.” BMC Bioinformatics 13 (1). BioMed Central: 20.