Extending cooccurNet

This vignette documents the official extension mechanism provided in cooccurNet 0.1.4. Most of this information is available scattered throughout the R documentation. This vignette brings it all together in one place.

As you read this document, you’ll learn to what the cooccurNet is, how to run the methods, and how to interpret, plot and report outputs.

Introduction

cooccurNet is an R package including functionalities of construction and analysis of residue (amino acid, nucleotide, SNP and so on) co-occurrence network. Besides, a new method for measuring residues coevolution, defined as residue co-occurrence score (RCOS), was proposed and implemented in it based on the co-occurrence network.

The co-occurrence network method was originally developed to capture the nucleotide co-occurrence pattern over the entire genome of human influenza H3N2 viruses [1]. It could effectively capture the evolutionary antigenic features of H3N2 virus at the whole-genome level and accurately describe the complex evolutionary patterns between individual gene segments. Besides influenza viruses, the co-occurrence network method was also used to Zaire ebolavirus (EBOV). In this work [2], we found that the features of co-occurrence networks (edges and nodes) built over the EBOV genome could accurately reflect the case fatality rate (CFR) of EBOV, which provided a novel and rapid way of assessing the CFR of EBOV even at the very beginning of the outbreak once its genome sequence is available. This suggests the potential usage of the co-occurrence network method in modeling the evolution of genomes.

As we know, the genomes contain all the genetic information of an organism, the evolution of which is very complex. Currently, there is a lack of computational models for capturing the complex evolutionary features over the entire genome. Coevolution is prevalent in biology and it is an important driving force in evolution. The co-occurrence network method could capture the coevolution linkage between residues within the genome. It is unique in that it could transform a genome into a residue co-occurrence network, the features of which, such as the nodes, edges, connectivity and so on, may reflect important features of organisms, such as the antigenicity [1] and CFR [2] in the examples mentioned above.

Actually, the co-occurrence network method is a general method which could be used to capture the co-occurrence pattern within samples. Besides genomes, it could be directly used in gene, protein, proteome, and so on. Due to a lack of a user-friendly software, the method is rarely used in other data type except for genomes. Based on the package “cooccurNet”, we believe the method could be widely used in biology.

The RCOS method was newly proposed in the package. It is a method for measuring the association between two discrete variables. It could be used to measure the extent of residue coevolution in protein/DNA/RNA/SNP. Residue coevolution reflects the functional and structural constraints within a protein or gene. It could help for protein structure or function prediction [3-6]. Many computational methods have been developed in measure residue coevolution. Our tests show that the RCOS method achieved a similar performance with a state-of-the-art method PSICOV on a benchmark dataset [7]. Therefore, we believe the RCOS method could become an important method in measuring residue coevolution.

There are two important parameters for both the methods of co-occurrence network and RCOS in the package “cooccurNet”. The first parameter, called as “conservativeFilter”, is used to filter the highly conservative sites for which the ratio of some residue is larger than the “conservativeFilter”. By default, it is set to be 0.95 for all data types. The second parameter, called as “cooccurFilter”, is used to determine whether a pair of residues has co-occurrence. According to the methods of co-occurrence network and RCOS, the smaller the “cooccurFilter” is, more pairs of residues would have co-occurrence and more false positives would be observed. To investigate the influence of this cutoff on the RCOS method, we used the RCOS method with cutoffs ranging from 0.8 to 1 to identify the structural constraints between residues based on the benchmark dataset derived from Jones’s work [5]. As shown in Table 1 below, as the cutoff increased, the number of residue pairs with significant co-occurrence decreased, while the ratio of residue pairs in contact among the top L/10, L/5, L/2 and L ranked pairs of residues increased. When the cutoff was set to be 1, only a few residue pairs with significant co-occurrence could be identified for most Pfam families. For balance of these two measures, we defined this cutoff to be 0.9 in default for the protein in the package. While for the other data types, such as DNA, RNA and SNP, “cooccurFilter” was set to be 1 in default. Both the parameters “conservativeFilter” and “cooccurFilter” could be re-set in the package.

Table 1: Performance of the RCOS method with the cutoff of “cooccurFilter” ranging from 0.8 to 1. For each value of the cutoff, it lists the average number of pairs of residues with significant co-occurrence (p-value < 0.05) identified by the RCOS method, and the ratios of residue pairs in contact among the top L/10, L/5, L/2 and L ranked pairs of residues for 150 Pfam families by the method of RCOS. “*“, not applicable due to fewer than L/10 pairs of residues with significant co-occurrence were identified.

Co-occurFilter	Average number of co-occurrence pairs	L/10	L/5	L/2	L
0.8	221	0.59	0.56	0.47	0.36
0.85	153	0.66	0.62	0.49	0.37
0.9	95	0.76	0.69	0.54	0.42
0.95	51	0.85	0.75	0.57	0.45
1	6	-*	-	-	-

References:

[1] Du X, Wang Z, Wu A, et al. Networks of genomic co-occurrence capture characteristics of human influenza A (H3N2) evolution[J]. Genome research, 2008, 18(1): 178-187.

[2] Deng L, Liu M, Hua S, et al. Network of co-mutations in Ebola virus genome predicts the disease lethality[J]. Cell research, 2015, 25(6): 753.

[3] Lee B C, Park K, Kim D. Analysis of the residue–residue coevolution network and the functionally important residues in proteins[J]. Proteins: Structure, Function, and Bioinformatics, 2008, 72(3): 863-872.

[4] Aurora R, Donlin M J, Cannon N A, et al. Genome-wide hepatitis C virus amino acid covariance networks can predict response to antiviral therapy in humans[J]. The Journal of clinical investigation, 2009, 119(1): 225-236.

[5] Jones D T, Buchan D W A, Cozzetto D, et al. PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments[J]. Bioinformatics, 2012, 28(2): 184-190.

[6] Wang S, Li W, Zhang R, et al. CoinFold: a web server for protein contact prediction and contact-assisted protein folding[J]. Nucleic acids research, 2016: gkw307.

[7] Zou Y, Wu Z, et al. cooccurNet: an R package for co-occurrence network construction and analysis. Bioinformatics (In submission).

Preliminaries

OS/Memory

cooccurNet compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.
System memory should ≥ 2Gb. See section “Optimization” for more details.

How to get help

Most questions about cooccurNet will hopefully be answered by the documentation or references.

cooccurNet is updated frequently. Once you have installed cooccurNet, You can get help from the change-log.

logs <- changeLog(n=20)
#> 
logs
#>  [1] "2016-12-16 Yuanqiang Zou  <jerrytsou2001@gmail.com>"           
#>  [2] ""                                                              
#>  [3] "\tDocument synchronization and test outputs."                  
#>  [4] ""                                                              
#>  [5] "2017-01-10 Yuanqiang Zou  <jerrytsou2001@gmail.com>"           
#>  [6] ""                                                              
#>  [7] "\tModifiy documentaion and import a new package \"doParallel\""
#>  [8] ""                                                              
#>  [9] "2017-01-27 Yuanqiang Zou  <jerrytsou2001@gmail.com>"           
#> [10] ""                                                              
#> [11] "\tOptimized the fuction \"toigraph\""

If you’ve run into a problem which could not be addressed by the documentation, or if you’ve found a conflict between the documentation and software itself, then please write to us: Yuanqiang Zou jerrytsou2001@gmail.com or Yousong Peng pys2013@hnu.edu.cn.

Quick Start

In this section, you’ll learn to how to run the methods by using the sample data.

Get sample sequence file

The parameter dataType could be ‘DNA’, ‘protein’, ‘SNP’, ‘RNA’. It’s ‘protein’ by default.

Get protein sample data file name with full path

dataFile=getexample(dataType="protein")
#> 
dataFile
#> [1] "C:/Users/Jerry/AppData/Local/Temp/Rtmpgvbts6/Rinst146815772caf/cooccurNet/extdata/protein"

Get DNA sample data file name with full path

dataFile=getexample(dataType="DNA")
#> 
dataFile
#> [1] "C:/Users/Jerry/AppData/Local/Temp/Rtmpgvbts6/Rinst146815772caf/cooccurNet/extdata/DNA"

Get SNP sample data file name with full path

dataFile=getexample(dataType="SNP")
#> 
dataFile
#> [1] "C:/Users/Jerry/AppData/Local/Temp/Rtmpgvbts6/Rinst146815772caf/cooccurNet/extdata/SNP"

Get RNA sample data file name with full path

dataFile=getexample(dataType="RNA")
#> 
dataFile
#> [1] "C:/Users/Jerry/AppData/Local/Temp/Rtmpgvbts6/Rinst146815772caf/cooccurNet/extdata/RNA"

Get sample RCOS protein sequence file names (sample human H3N2 HA protein sequence file names with full path)

dataFiles=getexample_forRCOS()
#> 
dataFiles
#> [1] "C:/Users/Jerry/AppData/Local/Temp/Rtmpgvbts6/Rinst146815772caf/cooccurNet/extdata/RCOS/HA1protein_humanH3N2"

Read FASTA format sequence file

To save the RAM usage by the package, we have designed the method of using prime number to stand for the nucleotides or amino acid in the data inputted. As is known to us, in the R environment an integer vector uses 4 bytes per number, while a character vector uses 8 bytes per character. Using prime number to stand for character could save half of RAM usage for storing the data inputted. Besides, when calculating the frequency of a pair of characters (nucleotide or amino acid), it is necessary to concatenate the characters in R. After replacing the characters with prime numbers, for a pair of prime numbers A and B, we found that unique number could be derived using the following formula for prime numbers (starting from 2):

           (round(A/B,5)+B)*100000                    (1)

The resulting values have a one-to-one correspondence with the pairs of nucleotide or amino acid. Therefore, the concentration of character vectors could be transformed to algebraic operation of prime number vectors, which could significantly reduce the time consumed in this process.

The parameter dataFile is a FASTA data file name with full path.

Read protein sample data

data = readseq(dataFile=getexample(dataType="protein"), dataType="protein")
#> Reading file......
#> 
#> completed. (10,329)
#> 
data$original[1:10,1:10]
#>       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#>  [1,] "Q"  "K"  "I"  "P"  "G"  "N"  "D"  "N"  "S"  "T"  
#>  [2,] "Q"  "K"  "L"  "P"  "G"  "N"  "D"  "N"  "S"  "T"  
#>  [3,] "Q"  "K"  "I"  "P"  "G"  "N"  "D"  "N"  "S"  "T"  
#>  [4,] "Q"  "K"  "I"  "P"  "G"  "N"  "D"  "N"  "S"  "T"  
#>  [5,] "Q"  "K"  "L"  "P"  "G"  "N"  "D"  "N"  "S"  "T"  
#>  [6,] "Q"  "K"  "V"  "P"  "G"  "N"  "D"  "N"  "S"  "T"  
#>  [7,] "Q"  "K"  "L"  "P"  "G"  "N"  "D"  "N"  "S"  "T"  
#>  [8,] "Q"  "K"  "I"  "P"  "G"  "N"  "D"  "N"  "S"  "T"  
#>  [9,] "Q"  "K"  "L"  "P"  "G"  "N"  "D"  "N"  "S"  "T"  
#> [10,] "Q"  "K"  "L"  "P"  "G"  "N"  "D"  "N"  "S"  "T"
data$matrix[1:10,1:10]
#>        1  2  3  4  5  6 7  8  9 10
#>  [1,] 43 23 19 41 13 37 5 37 53 59
#>  [2,] 43 23 29 41 13 37 5 37 53 59
#>  [3,] 43 23 19 41 13 37 5 37 53 59
#>  [4,] 43 23 19 41 13 37 5 37 53 59
#>  [5,] 43 23 29 41 13 37 5 37 53 59
#>  [6,] 43 23 61 41 13 37 5 37 53 59
#>  [7,] 43 23 29 41 13 37 5 37 53 59
#>  [8,] 43 23 19 41 13 37 5 37 53 59
#>  [9,] 43 23 29 41 13 37 5 37 53 59
#> [10,] 43 23 29 41 13 37 5 37 53 59

Read DNA sample data

data = readseq(dataFile=getexample(dataType="DNA"), dataType="DNA")
#> Reading file......
#> 
#> completed. (213,987)
#> 
data$original[1:10,1:10]
#>       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#>  [1,] "C"  "A"  "A"  "A"  "A"  "A"  "C"  "T"  "T"  "C"  
#>  [2,] "C"  "A"  "A"  "A"  "A"  "C"  "C"  "T"  "T"  "C"  
#>  [3,] "C"  "A"  "A"  "A"  "A"  "A"  "C"  "T"  "T"  "C"  
#>  [4,] "C"  "A"  "A"  "A"  "A"  "A"  "C"  "T"  "T"  "C"  
#>  [5,] "C"  "A"  "A"  "A"  "A"  "A"  "C"  "T"  "T"  "C"  
#>  [6,] "C"  "A"  "A"  "A"  "A"  "A"  "C"  "T"  "T"  "C"  
#>  [7,] "C"  "A"  "A"  "A"  "A"  "A"  "C"  "T"  "T"  "C"  
#>  [8,] "C"  "A"  "A"  "A"  "A"  "A"  "C"  "T"  "T"  "C"  
#>  [9,] "C"  "A"  "A"  "A"  "A"  "A"  "C"  "T"  "T"  "C"  
#> [10,] "C"  "A"  "A"  "A"  "A"  "A"  "C"  "T"  "T"  "C"
data$matrix[1:10,1:10]
#>       1 2 3 4 5 6 7 8 9 10
#>  [1,] 3 2 2 2 2 2 3 7 7  3
#>  [2,] 3 2 2 2 2 3 3 7 7  3
#>  [3,] 3 2 2 2 2 2 3 7 7  3
#>  [4,] 3 2 2 2 2 2 3 7 7  3
#>  [5,] 3 2 2 2 2 2 3 7 7  3
#>  [6,] 3 2 2 2 2 2 3 7 7  3
#>  [7,] 3 2 2 2 2 2 3 7 7  3
#>  [8,] 3 2 2 2 2 2 3 7 7  3
#>  [9,] 3 2 2 2 2 2 3 7 7  3
#> [10,] 3 2 2 2 2 2 3 7 7  3

Read SNP sample data

data = readseq(dataFile=getexample(dataType="SNP"), dataType="SNP")
#> Reading file......
#> 
#> completed. (100,50)
#> 
data$original[1:10,1:10]
#>       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#>  [1,] "1"  "0"  "1"  "1"  "0"  "1"  "0"  "0"  "0"  "0"  
#>  [2,] "0"  "1"  "0"  "0"  "1"  "1"  "0"  "1"  "1"  "0"  
#>  [3,] "1"  "0"  "1"  "1"  "1"  "0"  "0"  "0"  "1"  "0"  
#>  [4,] "1"  "1"  "0"  "0"  "1"  "1"  "0"  "1"  "0"  "1"  
#>  [5,] "0"  "0"  "0"  "1"  "0"  "1"  "0"  "0"  "1"  "0"  
#>  [6,] "0"  "0"  "1"  "1"  "1"  "1"  "0"  "0"  "0"  "0"  
#>  [7,] "0"  "1"  "0"  "0"  "0"  "0"  "0"  "0"  "0"  "1"  
#>  [8,] "1"  "0"  "0"  "0"  "0"  "0"  "0"  "0"  "1"  "0"  
#>  [9,] "1"  "1"  "0"  "1"  "0"  "1"  "1"  "0"  "1"  "0"  
#> [10,] "0"  "1"  "1"  "0"  "1"  "0"  "0"  "1"  "0"  "1"
data$matrix[1:10,1:10]
#>       1 2 3 4 5 6 7 8 9 10
#>  [1,] 3 2 3 3 2 3 2 2 2  2
#>  [2,] 2 3 2 2 3 3 2 3 3  2
#>  [3,] 3 2 3 3 3 2 2 2 3  2
#>  [4,] 3 3 2 2 3 3 2 3 2  3
#>  [5,] 2 2 2 3 2 3 2 2 3  2
#>  [6,] 2 2 3 3 3 3 2 2 2  2
#>  [7,] 2 3 2 2 2 2 2 2 2  3
#>  [8,] 3 2 2 2 2 2 2 2 3  2
#>  [9,] 3 3 2 3 2 3 3 2 3  2
#> [10,] 2 3 3 2 3 2 2 3 2  3

Read RNA sample data

data = readseq(dataFile=getexample(dataType="RNA"), dataType="RNA")
#> Reading file......
#> 
#> completed. (45,987)
#> 
data$original[1:10,1:10]
#>       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#>  [1,] "C"  "A"  "A"  "A"  "A"  "A"  "C"  "U"  "U"  "C"  
#>  [2,] "C"  "A"  "A"  "A"  "A"  "C"  "C"  "U"  "U"  "C"  
#>  [3,] "C"  "A"  "A"  "A"  "A"  "A"  "C"  "U"  "U"  "C"  
#>  [4,] "C"  "A"  "A"  "A"  "A"  "A"  "C"  "U"  "U"  "C"  
#>  [5,] "C"  "A"  "A"  "A"  "A"  "A"  "C"  "U"  "U"  "C"  
#>  [6,] "C"  "A"  "A"  "A"  "A"  "A"  "C"  "U"  "U"  "C"  
#>  [7,] "C"  "A"  "A"  "A"  "A"  "A"  "C"  "U"  "U"  "C"  
#>  [8,] "C"  "A"  "A"  "A"  "A"  "A"  "C"  "U"  "U"  "C"  
#>  [9,] "C"  "A"  "A"  "A"  "A"  "A"  "C"  "U"  "U"  "C"  
#> [10,] "C"  "A"  "A"  "A"  "A"  "A"  "C"  "U"  "U"  "C"
data$matrix[1:10,1:10]
#>       1 2 3 4 5 6 7 8 9 10
#>  [1,] 3 2 2 2 2 2 3 7 7  3
#>  [2,] 3 2 2 2 2 3 3 7 7  3
#>  [3,] 3 2 2 2 2 2 3 7 7  3
#>  [4,] 3 2 2 2 2 2 3 7 7  3
#>  [5,] 3 2 2 2 2 2 3 7 7  3
#>  [6,] 3 2 2 2 2 2 3 7 7  3
#>  [7,] 3 2 2 2 2 2 3 7 7  3
#>  [8,] 3 2 2 2 2 2 3 7 7  3
#>  [9,] 3 2 2 2 2 2 3 7 7  3
#> [10,] 3 2 2 2 2 2 3 7 7  3

Pre-process the data inputted

The parameter conservativeFilter is used to filter the highly conservative columns which the ratio of some residue is larger than the conservationFilter. By default, it is set to be 0.95.

data = readseq(dataFile=getexample(dataType="protein"), dataType="protein")
#> Reading file......
#> 
#> completed. (10,329)
#> 
data_process = pprocess(data=data,conservativeFilter=0.95)
#> Pre-processing......
#> 
#> Calculating bigram frequency ......
#> 
#> Pre-process......completed. (10,42)
#>

Generate co-occurrence network and save it into ‘networkFile’

For each sequence, a residue co-occurrence network would be generated. It measures the covariations between residues in the sequence. The features and properties of the network may capture important features of species, such as the antigenicity and CFR mentioned above in the section of introduction.

The parameter cooccurFilter is used to determine whether a pair of residues have perfect co-occurrence. By default, it is set to be 0.90 for the data type of protein, while it is set to be 1 for the data type of DNA, RNA, SNP and so on.

The parameter networkFile is a file name with full path for storing the co-occurrence network for each row. In default, it is set to be ‘cooccurNetwork’, which is in your current workspace (getwd())

cooccurNetwork = gencooccur(data=data_process, cooccurFilter=0.9, networkFile='cooccurNetwork')
#> 
#> 1. calculating networks ......
#> 
#> 2. creating network file (cooccurNetwork) ......completed
#> 
workspace = getwd()
workspace
#> [1] "C:/Users/Jerry/AppData/Local/Temp/Rtmpgvbts6/Rbuild146832c62740/cooccurNet/vignettes"
readLines(cooccurNetwork$networkFile)
#>  [1] "EPI823725 3-144 3-159 3-311 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 144-159 144-311 159-311 186-219 186-278 208-238 208-261 219-278 238-261"                                                                
#>  [2] "EPI775768 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 128-142 140-158 159-326 186-219 186-278 208-238 208-261 219-278 238-261 267-312"                                                                                  
#>  [3] "EPI814365 3-144 3-159 3-311 25-121 25-216 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 142-197 144-159 144-311 159-311 186-219 186-278 208-238 208-261 219-278 238-261"                                          
#>  [4] "EPI787403 3-144 3-159 3-311 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 142-197 144-159 144-311 159-311 186-219 186-278 208-238 208-261 219-278 238-261"                                                        
#>  [5] "EPI787049 25-208 25-238 25-261 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 128-142 140-158 159-326 186-219 186-278 208-238 208-261 219-278 238-261"                                                                     
#>  [6] "EPI775712 3-140 3-158 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 128-142 140-158 159-225 186-219 186-278 208-238 208-261 219-278 238-261"                                                                              
#>  [7] "EPI777683 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 159-225 186-219 186-278 208-238 208-261 219-278 238-261"                                                                                                  
#>  [8] "EPI787395 3-144 3-159 3-311 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 144-159 144-311 159-311 186-219 186-278 208-238 208-261 219-278 238-261"                                                                
#>  [9] "EPI731085 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 128-142 140-158 159-326 186-219 186-278 208-238 208-261 219-278 238-261"                                                                                          
#> [10] "EPI748184 43-45 43-48 43-144 43-186 43-219 43-267 43-278 45-48 45-144 45-186 45-219 45-267 45-278 48-144 48-186 48-219 48-267 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 144-186 144-219 144-267 144-278 159-225 186-219 186-267 186-278 208-238 208-261 219-267 219-278 238-261 267-278"

Generate co-occurrence network file by one command

cooccurNetwork = coocnet(dataFile=getexample(dataType="protein"), dataType="protein",conservativeFilter=0.95, cooccurFilter=0.9, networkFile='cooccurNetwork')
#> reading file......
#> 
#> completed. (10,329)
#> 
#> preprocess file......
#> 
#> Calculating bigram frequency ......
#> 
#> preprocess file......completed. (10,42)
#> 
#> 1. calculating networks ......
#> 
#> 2. creating network file (cooccurNetwork) ......completed
#> 
readLines(cooccurNetwork$networkFile)
#>  [1] "EPI823725 3-144 3-159 3-311 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 144-159 144-311 159-311 186-219 186-278 208-238 208-261 219-278 238-261"                                                                
#>  [2] "EPI775768 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 128-142 140-158 159-326 186-219 186-278 208-238 208-261 219-278 238-261 267-312"                                                                                  
#>  [3] "EPI814365 3-144 3-159 3-311 25-121 25-216 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 142-197 144-159 144-311 159-311 186-219 186-278 208-238 208-261 219-278 238-261"                                          
#>  [4] "EPI787403 3-144 3-159 3-311 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 142-197 144-159 144-311 159-311 186-219 186-278 208-238 208-261 219-278 238-261"                                                        
#>  [5] "EPI787049 25-208 25-238 25-261 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 128-142 140-158 159-326 186-219 186-278 208-238 208-261 219-278 238-261"                                                                     
#>  [6] "EPI775712 3-140 3-158 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 128-142 140-158 159-225 186-219 186-278 208-238 208-261 219-278 238-261"                                                                              
#>  [7] "EPI777683 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 159-225 186-219 186-278 208-238 208-261 219-278 238-261"                                                                                                  
#>  [8] "EPI787395 3-144 3-159 3-311 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 144-159 144-311 159-311 186-219 186-278 208-238 208-261 219-278 238-261"                                                                
#>  [9] "EPI731085 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 128-142 140-158 159-326 186-219 186-278 208-238 208-261 219-278 238-261"                                                                                          
#> [10] "EPI748184 43-45 43-48 43-144 43-186 43-219 43-267 43-278 45-48 45-144 45-186 45-219 45-267 45-278 48-144 48-186 48-219 48-267 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 144-186 144-219 144-267 144-278 159-225 186-219 186-267 186-278 208-238 208-261 219-267 219-278 238-261 267-278"

Generate co-occurrence network module file by one command

This would calculate all the modules (sub-graphs) included in each residue co-occurrence network.

The parameter module is a boolean flag to check whether the modules in each network of the networkFile would be calculated.

The parameter moduleFile is a file name with full path for storing the modules for co-occurrence network. It’s ‘cooccurNetworkModule’ by default.

cooccurNetwork  = coocnet(dataFile=getexample(dataType="protein"), dataType="protein",conservativeFilter=0.95, cooccurFilter=0.9, module = TRUE, moduleFile='cooccurNetworkModule')
#> reading file......
#> 
#> completed. (10,329)
#> 
#> preprocess file......
#> 
#> Calculating bigram frequency ......
#> 
#> preprocess file......completed. (10,42)
#> 
#> 1. calculating networks ......
#> 
#> 2. creating network file (cooccurNetwork) ......completed
#> 
#> 3. creating module file (cooccurNetworkModule) ......completed
#> 
readLines(cooccurNetwork$moduleFile)
#>  [1] "EPI823725 (3,144,159,311) (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (121,216) (140,158) (208,238,261)"              
#>  [2] "EPI775768 (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (121,216) (128,142) (140,158) (159,326) (208,238,261) (267,312)"
#>  [3] "EPI814365 (3,144,159,311) (25,121,216) (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (140,158) (142,197) (208,238,261)" 
#>  [4] "EPI787403 (3,144,159,311) (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (121,216) (140,158) (142,197) (208,238,261)"    
#>  [5] "EPI787049 (25,208,238,261) (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (121,216) (128,142) (140,158) (159,326)"       
#>  [6] "EPI775712 (3,140,158) (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (121,216) (128,142) (159,225) (208,238,261)"        
#>  [7] "EPI777683 (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (121,216) (140,158) (159,225) (208,238,261)"                    
#>  [8] "EPI787395 (3,144,159,311) (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (121,216) (140,158) (208,238,261)"              
#>  [9] "EPI731085 (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (121,216) (128,142) (140,158) (159,326) (208,238,261)"          
#> [10] "EPI748184 (43,45,48,144,186,219,267,278) (53,114,203) (62,259) (92,194) (121,216) (140,158) (159,225) (208,238,261)"

Generate co-occurrence network property file by one command

The parameter property is a boolean flag to check whether the properties for each network of the networkFile, including the network diameter, connectivity, ConnectionEffcient and so on, would be calculated.

The parameter propertyFile is file name with full path storing the properties for each network of the networkFile. It’s ‘cooccurNetworkProperty’ by default.

cooccurNetwork  = coocnet(dataFile=getexample(dataType="protein"), dataType="protein",conservativeFilter=0.95, cooccurFilter=0.9, property = TRUE, propertyFile='cooccurNetworkProperty')
#> reading file......
#> 
#> completed. (10,329)
#> 
#> preprocess file......
#> 
#> Calculating bigram frequency ......
#> 
#> preprocess file......completed. (10,42)
#> 
#> 1. calculating networks ......
#> 
#> 2. creating network file (cooccurNetwork) ......completed
#> 
#> 3. creating property file (cooccurNetworkProperty) ......completed
#> 
readLines(cooccurNetwork$propertyFile)[1:5]
#> [1] "name  Connectivity  Diameter  Radius  ConnectionEffcient"
#> [2] "EPI823725 0 1 1 1"                                       
#> [3] "EPI775768 0 1 1 1"                                       
#> [4] "EPI814365 0 1 1 1"                                       
#> [5] "EPI787403 0 1 1 1"

Generate the residue co-occurence file by one command

This is to calculate the pairwise residue co-occurrence score (RCOS) for all pairs of positions within the sequences. To evaluate the statistical significance of RCOS, the data would be permutated N times (N=100 in default). This process would take a long time. The RCOS was calculated in the permutated data and sorted decreasingly. The rank of the RCOS derived from the original data in the RCOSs derived from the permutated data divided by N was considered as the p-value for the ROCS.

The parameter siteCoFile is file name with full path for storing the RCOS between all pairs of columns, and the related p-values. It’s ‘siteCooccurr’ by default.

The parameter sampleTimes is an integer of permutations in the simulation when calculating the p-values. It should be greater than 100.

In the example listed below, we assign 10 to the parameter sampleTimes only for testing purpose. Normally, It should be greater than 100.

pairwiseCooccur = siteco(dataFile=getexample(dataType="protein"), dataType="protein", conservativeFilter=0.95, cooccurFilter=0.9, siteCoFile='siteCooccurr', sampleTimes=10)
#> reading file......
#> 
#> completed. (10,329)
#> 
#> preprocess file......
#> 
#> Calculating bigram frequency ......
#> 
#> preprocess file......completed. (10,42)
#> 
#> calculating cooccurrence..........
#> 
#> 1. creating siteCoFile (siteCooccurr), sampleTimes: 10 ......
#> 
#> [1] 1 3
#> 
#> [1] 4 6
#> 
#> [1]  7 10
#> completed
#> 
readLines(pairwiseCooccur$siteCoFile)[1:5]
#> [1] "Site_i Site_j Co-occur p-value" "3 25 0 1"                      
#> [3] "3 33 0 1"                       "3 43 0 1"                      
#> [5] "3 45 0 1"

Generate the co-occurrence network, network module, network property and RCOS file by one command

The parameter siteCo is a boolean flag to check whether the residue co-occurence file would be calculated.