This vignette documents the official extension mechanism provided in cooccurNet 0.1.4. Most of this information is available scattered throughout the R documentation. This vignette brings it all together in one place.
As you read this document, you’ll learn to what the cooccurNet is, how to run the methods, and how to interpret, plot and report outputs.
cooccurNet is an R package including functionalities of construction and analysis of residue (amino acid, nucleotide, SNP and so on) co-occurrence network. Besides, a new method for measuring residues coevolution, defined as residue co-occurrence score (RCOS), was proposed and implemented in it based on the co-occurrence network.
The co-occurrence network method was originally developed to capture the nucleotide co-occurrence pattern over the entire genome of human influenza H3N2 viruses [1]. It could effectively capture the evolutionary antigenic features of H3N2 virus at the whole-genome level and accurately describe the complex evolutionary patterns between individual gene segments. Besides influenza viruses, the co-occurrence network method was also used to Zaire ebolavirus (EBOV). In this work [2], we found that the features of co-occurrence networks (edges and nodes) built over the EBOV genome could accurately reflect the case fatality rate (CFR) of EBOV, which provided a novel and rapid way of assessing the CFR of EBOV even at the very beginning of the outbreak once its genome sequence is available. This suggests the potential usage of the co-occurrence network method in modeling the evolution of genomes.
As we know, the genomes contain all the genetic information of an organism, the evolution of which is very complex. Currently, there is a lack of computational models for capturing the complex evolutionary features over the entire genome. Coevolution is prevalent in biology and it is an important driving force in evolution. The co-occurrence network method could capture the coevolution linkage between residues within the genome. It is unique in that it could transform a genome into a residue co-occurrence network, the features of which, such as the nodes, edges, connectivity and so on, may reflect important features of organisms, such as the antigenicity [1] and CFR [2] in the examples mentioned above.
Actually, the co-occurrence network method is a general method which could be used to capture the co-occurrence pattern within samples. Besides genomes, it could be directly used in gene, protein, proteome, and so on. Due to a lack of a user-friendly software, the method is rarely used in other data type except for genomes. Based on the package “cooccurNet”, we believe the method could be widely used in biology.
The RCOS method was newly proposed in the package. It is a method for measuring the association between two discrete variables. It could be used to measure the extent of residue coevolution in protein/DNA/RNA/SNP. Residue coevolution reflects the functional and structural constraints within a protein or gene. It could help for protein structure or function prediction [3-6]. Many computational methods have been developed in measure residue coevolution. Our tests show that the RCOS method achieved a similar performance with a state-of-the-art method PSICOV on a benchmark dataset [7]. Therefore, we believe the RCOS method could become an important method in measuring residue coevolution.
There are two important parameters for both the methods of co-occurrence network and RCOS in the package “cooccurNet”. The first parameter, called as “conservativeFilter”, is used to filter the highly conservative sites for which the ratio of some residue is larger than the “conservativeFilter”. By default, it is set to be 0.95 for all data types. The second parameter, called as “cooccurFilter”, is used to determine whether a pair of residues has co-occurrence. According to the methods of co-occurrence network and RCOS, the smaller the “cooccurFilter” is, more pairs of residues would have co-occurrence and more false positives would be observed. To investigate the influence of this cutoff on the RCOS method, we used the RCOS method with cutoffs ranging from 0.8 to 1 to identify the structural constraints between residues based on the benchmark dataset derived from Jones’s work [5]. As shown in Table 1 below, as the cutoff increased, the number of residue pairs with significant co-occurrence decreased, while the ratio of residue pairs in contact among the top L/10, L/5, L/2 and L ranked pairs of residues increased. When the cutoff was set to be 1, only a few residue pairs with significant co-occurrence could be identified for most Pfam families. For balance of these two measures, we defined this cutoff to be 0.9 in default for the protein in the package. While for the other data types, such as DNA, RNA and SNP, “cooccurFilter” was set to be 1 in default. Both the parameters “conservativeFilter” and “cooccurFilter” could be re-set in the package.
Table 1: Performance of the RCOS method with the cutoff of “cooccurFilter” ranging from 0.8 to 1. For each value of the cutoff, it lists the average number of pairs of residues with significant co-occurrence (p-value < 0.05) identified by the RCOS method, and the ratios of residue pairs in contact among the top L/10, L/5, L/2 and L ranked pairs of residues for 150 Pfam families by the method of RCOS. “*“, not applicable due to fewer than L/10 pairs of residues with significant co-occurrence were identified.
Co-occurFilter | Average number of co-occurrence pairs | L/10 | L/5 | L/2 | L |
---|---|---|---|---|---|
0.8 | 221 | 0.59 | 0.56 | 0.47 | 0.36 |
0.85 | 153 | 0.66 | 0.62 | 0.49 | 0.37 |
0.9 | 95 | 0.76 | 0.69 | 0.54 | 0.42 |
0.95 | 51 | 0.85 | 0.75 | 0.57 | 0.45 |
1 | 6 | -* | - | - | - |
References:
[1] Du X, Wang Z, Wu A, et al. Networks of genomic co-occurrence capture characteristics of human influenza A (H3N2) evolution[J]. Genome research, 2008, 18(1): 178-187.
[2] Deng L, Liu M, Hua S, et al. Network of co-mutations in Ebola virus genome predicts the disease lethality[J]. Cell research, 2015, 25(6): 753.
[3] Lee B C, Park K, Kim D. Analysis of the residue–residue coevolution network and the functionally important residues in proteins[J]. Proteins: Structure, Function, and Bioinformatics, 2008, 72(3): 863-872.
[4] Aurora R, Donlin M J, Cannon N A, et al. Genome-wide hepatitis C virus amino acid covariance networks can predict response to antiviral therapy in humans[J]. The Journal of clinical investigation, 2009, 119(1): 225-236.
[5] Jones D T, Buchan D W A, Cozzetto D, et al. PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments[J]. Bioinformatics, 2012, 28(2): 184-190.
[6] Wang S, Li W, Zhang R, et al. CoinFold: a web server for protein contact prediction and contact-assisted protein folding[J]. Nucleic acids research, 2016: gkw307.
[7] Zou Y, Wu Z, et al. cooccurNet: an R package for co-occurrence network construction and analysis. Bioinformatics (In submission).
Most questions about cooccurNet will hopefully be answered by the documentation or references.
cooccurNet is updated frequently. Once you have installed cooccurNet, You can get help from the change-log.
logs <- changeLog(n=20)
#>
logs
#> [1] "2016-12-16 Yuanqiang Zou <jerrytsou2001@gmail.com>"
#> [2] ""
#> [3] "\tDocument synchronization and test outputs."
#> [4] ""
#> [5] "2017-01-10 Yuanqiang Zou <jerrytsou2001@gmail.com>"
#> [6] ""
#> [7] "\tModifiy documentaion and import a new package \"doParallel\""
#> [8] ""
#> [9] "2017-01-27 Yuanqiang Zou <jerrytsou2001@gmail.com>"
#> [10] ""
#> [11] "\tOptimized the fuction \"toigraph\""
If you’ve run into a problem which could not be addressed by the documentation, or if you’ve found a conflict between the documentation and software itself, then please write to us: Yuanqiang Zou jerrytsou2001@gmail.com or Yousong Peng pys2013@hnu.edu.cn.
In this section, you’ll learn to how to run the methods by using the sample data.
The parameter dataType
could be ‘DNA’, ‘protein’, ‘SNP’, ‘RNA’. It’s ‘protein’ by default.
dataFile=getexample(dataType="protein")
#>
dataFile
#> [1] "C:/Users/Jerry/AppData/Local/Temp/Rtmpgvbts6/Rinst146815772caf/cooccurNet/extdata/protein"
dataFile=getexample(dataType="DNA")
#>
dataFile
#> [1] "C:/Users/Jerry/AppData/Local/Temp/Rtmpgvbts6/Rinst146815772caf/cooccurNet/extdata/DNA"
dataFile=getexample(dataType="SNP")
#>
dataFile
#> [1] "C:/Users/Jerry/AppData/Local/Temp/Rtmpgvbts6/Rinst146815772caf/cooccurNet/extdata/SNP"
dataFile=getexample(dataType="RNA")
#>
dataFile
#> [1] "C:/Users/Jerry/AppData/Local/Temp/Rtmpgvbts6/Rinst146815772caf/cooccurNet/extdata/RNA"
dataFiles=getexample_forRCOS()
#>
dataFiles
#> [1] "C:/Users/Jerry/AppData/Local/Temp/Rtmpgvbts6/Rinst146815772caf/cooccurNet/extdata/RCOS/HA1protein_humanH3N2"
To save the RAM usage by the package, we have designed the method of using prime number to stand for the nucleotides or amino acid in the data inputted. As is known to us, in the R environment an integer vector uses 4 bytes per number, while a character vector uses 8 bytes per character. Using prime number to stand for character could save half of RAM usage for storing the data inputted. Besides, when calculating the frequency of a pair of characters (nucleotide or amino acid), it is necessary to concatenate the characters in R. After replacing the characters with prime numbers, for a pair of prime numbers A and B, we found that unique number could be derived using the following formula for prime numbers (starting from 2):
(round(A/B,5)+B)*100000 (1)
The resulting values have a one-to-one correspondence with the pairs of nucleotide or amino acid. Therefore, the concentration of character vectors could be transformed to algebraic operation of prime number vectors, which could significantly reduce the time consumed in this process.
The parameter dataFile
is a FASTA data file name with full path.
data = readseq(dataFile=getexample(dataType="protein"), dataType="protein")
#> Reading file......
#>
#> completed. (10,329)
#>
data$original[1:10,1:10]
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] "Q" "K" "I" "P" "G" "N" "D" "N" "S" "T"
#> [2,] "Q" "K" "L" "P" "G" "N" "D" "N" "S" "T"
#> [3,] "Q" "K" "I" "P" "G" "N" "D" "N" "S" "T"
#> [4,] "Q" "K" "I" "P" "G" "N" "D" "N" "S" "T"
#> [5,] "Q" "K" "L" "P" "G" "N" "D" "N" "S" "T"
#> [6,] "Q" "K" "V" "P" "G" "N" "D" "N" "S" "T"
#> [7,] "Q" "K" "L" "P" "G" "N" "D" "N" "S" "T"
#> [8,] "Q" "K" "I" "P" "G" "N" "D" "N" "S" "T"
#> [9,] "Q" "K" "L" "P" "G" "N" "D" "N" "S" "T"
#> [10,] "Q" "K" "L" "P" "G" "N" "D" "N" "S" "T"
data$matrix[1:10,1:10]
#> 1 2 3 4 5 6 7 8 9 10
#> [1,] 43 23 19 41 13 37 5 37 53 59
#> [2,] 43 23 29 41 13 37 5 37 53 59
#> [3,] 43 23 19 41 13 37 5 37 53 59
#> [4,] 43 23 19 41 13 37 5 37 53 59
#> [5,] 43 23 29 41 13 37 5 37 53 59
#> [6,] 43 23 61 41 13 37 5 37 53 59
#> [7,] 43 23 29 41 13 37 5 37 53 59
#> [8,] 43 23 19 41 13 37 5 37 53 59
#> [9,] 43 23 29 41 13 37 5 37 53 59
#> [10,] 43 23 29 41 13 37 5 37 53 59
data = readseq(dataFile=getexample(dataType="DNA"), dataType="DNA")
#> Reading file......
#>
#> completed. (213,987)
#>
data$original[1:10,1:10]
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] "C" "A" "A" "A" "A" "A" "C" "T" "T" "C"
#> [2,] "C" "A" "A" "A" "A" "C" "C" "T" "T" "C"
#> [3,] "C" "A" "A" "A" "A" "A" "C" "T" "T" "C"
#> [4,] "C" "A" "A" "A" "A" "A" "C" "T" "T" "C"
#> [5,] "C" "A" "A" "A" "A" "A" "C" "T" "T" "C"
#> [6,] "C" "A" "A" "A" "A" "A" "C" "T" "T" "C"
#> [7,] "C" "A" "A" "A" "A" "A" "C" "T" "T" "C"
#> [8,] "C" "A" "A" "A" "A" "A" "C" "T" "T" "C"
#> [9,] "C" "A" "A" "A" "A" "A" "C" "T" "T" "C"
#> [10,] "C" "A" "A" "A" "A" "A" "C" "T" "T" "C"
data$matrix[1:10,1:10]
#> 1 2 3 4 5 6 7 8 9 10
#> [1,] 3 2 2 2 2 2 3 7 7 3
#> [2,] 3 2 2 2 2 3 3 7 7 3
#> [3,] 3 2 2 2 2 2 3 7 7 3
#> [4,] 3 2 2 2 2 2 3 7 7 3
#> [5,] 3 2 2 2 2 2 3 7 7 3
#> [6,] 3 2 2 2 2 2 3 7 7 3
#> [7,] 3 2 2 2 2 2 3 7 7 3
#> [8,] 3 2 2 2 2 2 3 7 7 3
#> [9,] 3 2 2 2 2 2 3 7 7 3
#> [10,] 3 2 2 2 2 2 3 7 7 3
data = readseq(dataFile=getexample(dataType="SNP"), dataType="SNP")
#> Reading file......
#>
#> completed. (100,50)
#>
data$original[1:10,1:10]
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] "1" "0" "1" "1" "0" "1" "0" "0" "0" "0"
#> [2,] "0" "1" "0" "0" "1" "1" "0" "1" "1" "0"
#> [3,] "1" "0" "1" "1" "1" "0" "0" "0" "1" "0"
#> [4,] "1" "1" "0" "0" "1" "1" "0" "1" "0" "1"
#> [5,] "0" "0" "0" "1" "0" "1" "0" "0" "1" "0"
#> [6,] "0" "0" "1" "1" "1" "1" "0" "0" "0" "0"
#> [7,] "0" "1" "0" "0" "0" "0" "0" "0" "0" "1"
#> [8,] "1" "0" "0" "0" "0" "0" "0" "0" "1" "0"
#> [9,] "1" "1" "0" "1" "0" "1" "1" "0" "1" "0"
#> [10,] "0" "1" "1" "0" "1" "0" "0" "1" "0" "1"
data$matrix[1:10,1:10]
#> 1 2 3 4 5 6 7 8 9 10
#> [1,] 3 2 3 3 2 3 2 2 2 2
#> [2,] 2 3 2 2 3 3 2 3 3 2
#> [3,] 3 2 3 3 3 2 2 2 3 2
#> [4,] 3 3 2 2 3 3 2 3 2 3
#> [5,] 2 2 2 3 2 3 2 2 3 2
#> [6,] 2 2 3 3 3 3 2 2 2 2
#> [7,] 2 3 2 2 2 2 2 2 2 3
#> [8,] 3 2 2 2 2 2 2 2 3 2
#> [9,] 3 3 2 3 2 3 3 2 3 2
#> [10,] 2 3 3 2 3 2 2 3 2 3
data = readseq(dataFile=getexample(dataType="RNA"), dataType="RNA")
#> Reading file......
#>
#> completed. (45,987)
#>
data$original[1:10,1:10]
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] "C" "A" "A" "A" "A" "A" "C" "U" "U" "C"
#> [2,] "C" "A" "A" "A" "A" "C" "C" "U" "U" "C"
#> [3,] "C" "A" "A" "A" "A" "A" "C" "U" "U" "C"
#> [4,] "C" "A" "A" "A" "A" "A" "C" "U" "U" "C"
#> [5,] "C" "A" "A" "A" "A" "A" "C" "U" "U" "C"
#> [6,] "C" "A" "A" "A" "A" "A" "C" "U" "U" "C"
#> [7,] "C" "A" "A" "A" "A" "A" "C" "U" "U" "C"
#> [8,] "C" "A" "A" "A" "A" "A" "C" "U" "U" "C"
#> [9,] "C" "A" "A" "A" "A" "A" "C" "U" "U" "C"
#> [10,] "C" "A" "A" "A" "A" "A" "C" "U" "U" "C"
data$matrix[1:10,1:10]
#> 1 2 3 4 5 6 7 8 9 10
#> [1,] 3 2 2 2 2 2 3 7 7 3
#> [2,] 3 2 2 2 2 3 3 7 7 3
#> [3,] 3 2 2 2 2 2 3 7 7 3
#> [4,] 3 2 2 2 2 2 3 7 7 3
#> [5,] 3 2 2 2 2 2 3 7 7 3
#> [6,] 3 2 2 2 2 2 3 7 7 3
#> [7,] 3 2 2 2 2 2 3 7 7 3
#> [8,] 3 2 2 2 2 2 3 7 7 3
#> [9,] 3 2 2 2 2 2 3 7 7 3
#> [10,] 3 2 2 2 2 2 3 7 7 3
The parameter conservativeFilter
is used to filter the highly conservative columns which the ratio of some residue is larger than the conservationFilter. By default, it is set to be 0.95.
data = readseq(dataFile=getexample(dataType="protein"), dataType="protein")
#> Reading file......
#>
#> completed. (10,329)
#>
data_process = pprocess(data=data,conservativeFilter=0.95)
#> Pre-processing......
#>
#> Calculating bigram frequency ......
#>
#> Pre-process......completed. (10,42)
#>
For each sequence, a residue co-occurrence network would be generated. It measures the covariations between residues in the sequence. The features and properties of the network may capture important features of species, such as the antigenicity and CFR mentioned above in the section of introduction.
The parameter cooccurFilter
is used to determine whether a pair of residues have perfect co-occurrence. By default, it is set to be 0.90 for the data type of protein, while it is set to be 1 for the data type of DNA, RNA, SNP and so on.
The parameter networkFile
is a file name with full path for storing the co-occurrence network for each row. In default, it is set to be ‘cooccurNetwork’, which is in your current workspace (getwd()
)
cooccurNetwork = gencooccur(data=data_process, cooccurFilter=0.9, networkFile='cooccurNetwork')
#>
#> 1. calculating networks ......
#>
#> 2. creating network file (cooccurNetwork) ......completed
#>
workspace = getwd()
workspace
#> [1] "C:/Users/Jerry/AppData/Local/Temp/Rtmpgvbts6/Rbuild146832c62740/cooccurNet/vignettes"
readLines(cooccurNetwork$networkFile)
#> [1] "EPI823725 3-144 3-159 3-311 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 144-159 144-311 159-311 186-219 186-278 208-238 208-261 219-278 238-261"
#> [2] "EPI775768 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 128-142 140-158 159-326 186-219 186-278 208-238 208-261 219-278 238-261 267-312"
#> [3] "EPI814365 3-144 3-159 3-311 25-121 25-216 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 142-197 144-159 144-311 159-311 186-219 186-278 208-238 208-261 219-278 238-261"
#> [4] "EPI787403 3-144 3-159 3-311 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 142-197 144-159 144-311 159-311 186-219 186-278 208-238 208-261 219-278 238-261"
#> [5] "EPI787049 25-208 25-238 25-261 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 128-142 140-158 159-326 186-219 186-278 208-238 208-261 219-278 238-261"
#> [6] "EPI775712 3-140 3-158 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 128-142 140-158 159-225 186-219 186-278 208-238 208-261 219-278 238-261"
#> [7] "EPI777683 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 159-225 186-219 186-278 208-238 208-261 219-278 238-261"
#> [8] "EPI787395 3-144 3-159 3-311 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 144-159 144-311 159-311 186-219 186-278 208-238 208-261 219-278 238-261"
#> [9] "EPI731085 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 128-142 140-158 159-326 186-219 186-278 208-238 208-261 219-278 238-261"
#> [10] "EPI748184 43-45 43-48 43-144 43-186 43-219 43-267 43-278 45-48 45-144 45-186 45-219 45-267 45-278 48-144 48-186 48-219 48-267 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 144-186 144-219 144-267 144-278 159-225 186-219 186-267 186-278 208-238 208-261 219-267 219-278 238-261 267-278"
cooccurNetwork = coocnet(dataFile=getexample(dataType="protein"), dataType="protein",conservativeFilter=0.95, cooccurFilter=0.9, networkFile='cooccurNetwork')
#> reading file......
#>
#> completed. (10,329)
#>
#> preprocess file......
#>
#> Calculating bigram frequency ......
#>
#> preprocess file......completed. (10,42)
#>
#> 1. calculating networks ......
#>
#> 2. creating network file (cooccurNetwork) ......completed
#>
readLines(cooccurNetwork$networkFile)
#> [1] "EPI823725 3-144 3-159 3-311 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 144-159 144-311 159-311 186-219 186-278 208-238 208-261 219-278 238-261"
#> [2] "EPI775768 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 128-142 140-158 159-326 186-219 186-278 208-238 208-261 219-278 238-261 267-312"
#> [3] "EPI814365 3-144 3-159 3-311 25-121 25-216 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 142-197 144-159 144-311 159-311 186-219 186-278 208-238 208-261 219-278 238-261"
#> [4] "EPI787403 3-144 3-159 3-311 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 142-197 144-159 144-311 159-311 186-219 186-278 208-238 208-261 219-278 238-261"
#> [5] "EPI787049 25-208 25-238 25-261 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 128-142 140-158 159-326 186-219 186-278 208-238 208-261 219-278 238-261"
#> [6] "EPI775712 3-140 3-158 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 128-142 140-158 159-225 186-219 186-278 208-238 208-261 219-278 238-261"
#> [7] "EPI777683 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 159-225 186-219 186-278 208-238 208-261 219-278 238-261"
#> [8] "EPI787395 3-144 3-159 3-311 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 144-159 144-311 159-311 186-219 186-278 208-238 208-261 219-278 238-261"
#> [9] "EPI731085 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 128-142 140-158 159-326 186-219 186-278 208-238 208-261 219-278 238-261"
#> [10] "EPI748184 43-45 43-48 43-144 43-186 43-219 43-267 43-278 45-48 45-144 45-186 45-219 45-267 45-278 48-144 48-186 48-219 48-267 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 144-186 144-219 144-267 144-278 159-225 186-219 186-267 186-278 208-238 208-261 219-267 219-278 238-261 267-278"
This would calculate all the modules (sub-graphs) included in each residue co-occurrence network.
The parameter module
is a boolean flag to check whether the modules in each network of the networkFile would be calculated.
The parameter moduleFile
is a file name with full path for storing the modules for co-occurrence network. It’s ‘cooccurNetworkModule’ by default.
cooccurNetwork = coocnet(dataFile=getexample(dataType="protein"), dataType="protein",conservativeFilter=0.95, cooccurFilter=0.9, module = TRUE, moduleFile='cooccurNetworkModule')
#> reading file......
#>
#> completed. (10,329)
#>
#> preprocess file......
#>
#> Calculating bigram frequency ......
#>
#> preprocess file......completed. (10,42)
#>
#> 1. calculating networks ......
#>
#> 2. creating network file (cooccurNetwork) ......completed
#>
#> 3. creating module file (cooccurNetworkModule) ......completed
#>
readLines(cooccurNetwork$moduleFile)
#> [1] "EPI823725 (3,144,159,311) (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (121,216) (140,158) (208,238,261)"
#> [2] "EPI775768 (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (121,216) (128,142) (140,158) (159,326) (208,238,261) (267,312)"
#> [3] "EPI814365 (3,144,159,311) (25,121,216) (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (140,158) (142,197) (208,238,261)"
#> [4] "EPI787403 (3,144,159,311) (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (121,216) (140,158) (142,197) (208,238,261)"
#> [5] "EPI787049 (25,208,238,261) (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (121,216) (128,142) (140,158) (159,326)"
#> [6] "EPI775712 (3,140,158) (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (121,216) (128,142) (159,225) (208,238,261)"
#> [7] "EPI777683 (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (121,216) (140,158) (159,225) (208,238,261)"
#> [8] "EPI787395 (3,144,159,311) (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (121,216) (140,158) (208,238,261)"
#> [9] "EPI731085 (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (121,216) (128,142) (140,158) (159,326) (208,238,261)"
#> [10] "EPI748184 (43,45,48,144,186,219,267,278) (53,114,203) (62,259) (92,194) (121,216) (140,158) (159,225) (208,238,261)"
The parameter property
is a boolean flag to check whether the properties for each network of the networkFile, including the network diameter, connectivity, ConnectionEffcient and so on, would be calculated.
The parameter propertyFile
is file name with full path storing the properties for each network of the networkFile. It’s ‘cooccurNetworkProperty’ by default.
cooccurNetwork = coocnet(dataFile=getexample(dataType="protein"), dataType="protein",conservativeFilter=0.95, cooccurFilter=0.9, property = TRUE, propertyFile='cooccurNetworkProperty')
#> reading file......
#>
#> completed. (10,329)
#>
#> preprocess file......
#>
#> Calculating bigram frequency ......
#>
#> preprocess file......completed. (10,42)
#>
#> 1. calculating networks ......
#>
#> 2. creating network file (cooccurNetwork) ......completed
#>
#> 3. creating property file (cooccurNetworkProperty) ......completed
#>
readLines(cooccurNetwork$propertyFile)[1:5]
#> [1] "name Connectivity Diameter Radius ConnectionEffcient"
#> [2] "EPI823725 0 1 1 1"
#> [3] "EPI775768 0 1 1 1"
#> [4] "EPI814365 0 1 1 1"
#> [5] "EPI787403 0 1 1 1"
This is to calculate the pairwise residue co-occurrence score (RCOS) for all pairs of positions within the sequences. To evaluate the statistical significance of RCOS, the data would be permutated N times (N=100 in default). This process would take a long time. The RCOS was calculated in the permutated data and sorted decreasingly. The rank of the RCOS derived from the original data in the RCOSs derived from the permutated data divided by N was considered as the p-value for the ROCS.
The parameter siteCoFile
is file name with full path for storing the RCOS between all pairs of columns, and the related p-values. It’s ‘siteCooccurr’ by default.
The parameter sampleTimes
is an integer of permutations in the simulation when calculating the p-values. It should be greater than 100.
In the example listed below, we assign 10 to the parameter sampleTimes
only for testing purpose. Normally, It should be greater than 100.
pairwiseCooccur = siteco(dataFile=getexample(dataType="protein"), dataType="protein", conservativeFilter=0.95, cooccurFilter=0.9, siteCoFile='siteCooccurr', sampleTimes=10)
#> reading file......
#>
#> completed. (10,329)
#>
#> preprocess file......
#>
#> Calculating bigram frequency ......
#>
#> preprocess file......completed. (10,42)
#>
#> calculating cooccurrence..........
#>
#> 1. creating siteCoFile (siteCooccurr), sampleTimes: 10 ......
#>
#> [1] 1 3
#>
#> [1] 4 6
#>
#> [1] 7 10
#> completed
#>
readLines(pairwiseCooccur$siteCoFile)[1:5]
#> [1] "Site_i Site_j Co-occur p-value" "3 25 0 1"
#> [3] "3 33 0 1" "3 43 0 1"
#> [5] "3 45 0 1"
The parameter siteCo
is a boolean flag to check whether the residue co-occurence file would be calculated.
In the example listed below, we assign 10 to the parameter sampleTimes
only for testing purpose. Normally, It should be greater than 100.
cooccurNetwork = coocnet(dataFile=getexample(dataType="protein"), dataType="protein",conservativeFilter=0.95, cooccurFilter=0.9, networkFile='cooccurNetwork', module = TRUE, moduleFile='cooccurNetworkModule', property = TRUE, propertyFile='cooccurNetworkProperty', siteCo=TRUE, siteCoFile='siteCooccurr', sampleTimes=10 )
#> reading file......
#>
#> completed. (10,329)
#>
#> preprocess file......
#>
#> Calculating bigram frequency ......
#>
#> preprocess file......completed. (10,42)
#>
#> 1. calculating networks ......
#>
#> 2. creating network file (cooccurNetwork) ......completed
#>
#> 3. creating module file (cooccurNetworkModule) ......completed
#>
#> 4. creating property file (cooccurNetworkProperty) ......completed
#>
#> calculating cooccurrence..........
#>
#> 5. creating siteCoFile (siteCooccurr), sampleTimes: 10 ......
#>
#> [1] 1 3
#>
#> [1] 4 6
#>
#> [1] 7 10
#> completed
#>
readLines(cooccurNetwork$networkFile)
#> [1] "EPI823725 3-144 3-159 3-311 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 144-159 144-311 159-311 186-219 186-278 208-238 208-261 219-278 238-261"
#> [2] "EPI775768 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 128-142 140-158 159-326 186-219 186-278 208-238 208-261 219-278 238-261 267-312"
#> [3] "EPI814365 3-144 3-159 3-311 25-121 25-216 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 142-197 144-159 144-311 159-311 186-219 186-278 208-238 208-261 219-278 238-261"
#> [4] "EPI787403 3-144 3-159 3-311 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 142-197 144-159 144-311 159-311 186-219 186-278 208-238 208-261 219-278 238-261"
#> [5] "EPI787049 25-208 25-238 25-261 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 128-142 140-158 159-326 186-219 186-278 208-238 208-261 219-278 238-261"
#> [6] "EPI775712 3-140 3-158 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 128-142 140-158 159-225 186-219 186-278 208-238 208-261 219-278 238-261"
#> [7] "EPI777683 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 159-225 186-219 186-278 208-238 208-261 219-278 238-261"
#> [8] "EPI787395 3-144 3-159 3-311 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 144-159 144-311 159-311 186-219 186-278 208-238 208-261 219-278 238-261"
#> [9] "EPI731085 43-45 43-48 43-186 43-219 43-278 45-48 45-186 45-219 45-278 48-186 48-219 48-278 53-114 53-203 62-259 92-194 114-203 121-216 128-142 140-158 159-326 186-219 186-278 208-238 208-261 219-278 238-261"
#> [10] "EPI748184 43-45 43-48 43-144 43-186 43-219 43-267 43-278 45-48 45-144 45-186 45-219 45-267 45-278 48-144 48-186 48-219 48-267 48-278 53-114 53-203 62-259 92-194 114-203 121-216 140-158 144-186 144-219 144-267 144-278 159-225 186-219 186-267 186-278 208-238 208-261 219-267 219-278 238-261 267-278"
readLines(cooccurNetwork$moduleFile)
#> [1] "EPI823725 (3,144,159,311) (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (121,216) (140,158) (208,238,261)"
#> [2] "EPI775768 (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (121,216) (128,142) (140,158) (159,326) (208,238,261) (267,312)"
#> [3] "EPI814365 (3,144,159,311) (25,121,216) (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (140,158) (142,197) (208,238,261)"
#> [4] "EPI787403 (3,144,159,311) (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (121,216) (140,158) (142,197) (208,238,261)"
#> [5] "EPI787049 (25,208,238,261) (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (121,216) (128,142) (140,158) (159,326)"
#> [6] "EPI775712 (3,140,158) (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (121,216) (128,142) (159,225) (208,238,261)"
#> [7] "EPI777683 (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (121,216) (140,158) (159,225) (208,238,261)"
#> [8] "EPI787395 (3,144,159,311) (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (121,216) (140,158) (208,238,261)"
#> [9] "EPI731085 (43,45,48,186,219,278) (53,114,203) (62,259) (92,194) (121,216) (128,142) (140,158) (159,326) (208,238,261)"
#> [10] "EPI748184 (43,45,48,144,186,219,267,278) (53,114,203) (62,259) (92,194) (121,216) (140,158) (159,225) (208,238,261)"
readLines(cooccurNetwork$propertyFile)
#> [1] "name Connectivity Diameter Radius ConnectionEffcient"
#> [2] "EPI823725 0 1 1 1"
#> [3] "EPI775768 0 1 1 1"
#> [4] "EPI814365 0 1 1 1"
#> [5] "EPI787403 0 1 1 1"
#> [6] "EPI787049 0 1 1 1"
#> [7] "EPI775712 0 1 1 1"
#> [8] "EPI777683 0 1 1 1"
#> [9] "EPI787395 0 1 1 1"
#> [10] "EPI731085 0 1 1 1"
#> [11] "EPI748184 0 1 1 1"
readLines(cooccurNetwork$siteCoFile)[1:5]
#> [1] "Site_i Site_j Co-occur p-value" "3 25 0 1"
#> [3] "3 33 0 1" "3 43 0 1"
#> [5] "3 45 0 1"
The parameter networkFile
is file name with full path for storing the co-occurrence network for each row.
The parameter networkName
is a vector of network names.
cooccurNetwork = coocnet(dataFile=getexample(dataType="protein"),dataType="protein")
#> reading file......
#>
#> completed. (10,329)
#>
#> preprocess file......
#>
#> Calculating bigram frequency ......
#>
#> preprocess file......completed. (10,42)
#>
#> 1. calculating networks ......
#>
#> 2. creating network file (cooccurNetwork) ......completed
#>
network_igraph = toigraph(networkFile=cooccurNetwork$networkFile, networkName=c("EPI823725"))
#>
plot(network_igraph[[1]])
cooccurNetwork = coocnet(dataFile=getexample(dataType="protein"),dataType="protein")
#> reading file......
#>
#> completed. (10,329)
#>
#> preprocess file......
#>
#> Calculating bigram frequency ......
#>
#> preprocess file......completed. (10,42)
#>
#> 1. calculating networks ......
#>
#> 2. creating network file (cooccurNetwork) ......completed
#>
network_igraph = toigraph(networkFile=cooccurNetwork$networkFile, networkName=cooccurNetwork$xnames)
#>
plot(network_igraph[[2]])
workspace = getwd()
cooccurNetwork = coocnet(dataFile=getexample(dataType="protein"),dataType="protein")
#> reading file......
#>
#> completed. (10,329)
#>
#> preprocess file......
#>
#> Calculating bigram frequency ......
#>
#> preprocess file......completed. (10,42)
#>
#> 1. calculating networks ......
#>
#> 2. creating network file (cooccurNetwork) ......completed
#>
network_igraph = toigraph(networkFile=cooccurNetwork$networkFile, networkName=cooccurNetwork$xnames)
#>
#save as PDF
pdf(file="sample1.pdf")
plot(network_igraph[[1]])
dev.off()
#> png
#> 2
#save as jpeg
jpeg(file="sample1.jpeg")
plot(network_igraph[[2]])
dev.off()
#> png
#> 2
workspace
#> [1] "C:/Users/Jerry/AppData/Local/Temp/Rtmpgvbts6/Rbuild146832c62740/cooccurNet/vignettes"
The outputs for the package cooccurNet are listed as follows:
A co-occurrence network (networkFile
) was given for each sequence, which could be easily transformed into the format of igraph by the function toigraph();
The inherent modules(moduleFile
) in each co-occurrence network and some basic network attributes(propertyFile
) such as connectivity
and clustering coefficient
for each network were given;
The extent of co-occurrence between residues(siteCoFile
), defined as residue co-occurrence score (RCOS), were given for all pairs of residues.
During the development of cooccurNet, we found that cooccurNet consumed lots of RAM, which may lead to crash of computer when the data inputted is large. In order to solve this RAM shortage problem, several approaches were taken into account.
The parameter memory
indicates the type of matrix used in cooccurNet. It could be ‘memory’ or ‘sparse’. If it’s set to be ‘memory’, all data would be manipulated in the RAM by using normal matrix and package ‘bigmemory’. If it’s set to be ‘sparse’, the package “Matrix” would be used to manipulate massive matrices in memory and initialize huge sparse matrix, which could significantly reduce the RAM consumed. In default, it is set to be NULL, so that cooccurNet would determine automatically whether all data is manipulated in the RAM or not, according to the size of data inputted and the RAM available for R.
If your system has not enough RAM, you can set memory='sparse'
to keep this program working properly.
cooccurNetwork = coocnet(dataFile=getexample(dataType="protein"), dataType="protein",conservativeFilter=0.95, cooccurFilter=0.9, networkFile='cooccurNetwork', memory='sparse')
#> reading file......
#>
#> completed. (10,329)
#>
#> preprocess file......preprocess file......completed. (10,42)
#>
#> 1. calculating networks ......
#>
#> 2. creating network file (cooccurNetwork) ......completed
#>
Dimension of data inputted | Size of data inputted | RAM requirement without optimization | RAM usage after optimization |
---|---|---|---|
10 x 10 | 400 b | 1.8 Kb | 1.6 Kb |
100 x 100 | 39 Kb | 1933 Kb | 20.7 Kb |
1000x 1000 | 3.8 Mb | 1905 Mb | 1.9 Mb |
10000x 10000 | 381.5 Mb | 1862 Gb | 190.7 Mb |
The parameter parallel
is a boolean flag to indicate whether the parallel function is in use or not.
In the current version of cooccurNet (0.1.4) , the package parallel
from R core is in use. For better memory handling purpose, we use FORK
, but it only supports Unix/Mac (not Windows) system.
Only if the parameter memory = 'sparse'
, ‘parallel’ would take effect.
Parallel uses max(1, detectCores()-1)
cores. If you want to specify yours, plesae create a text file named cores
in your workspace and put a number in it.
cooccurNetwork = coocnet(dataFile=getexample(dataType="protein"), dataType="protein",conservativeFilter=0.95, cooccurFilter=0.9, networkFile='cooccurNetwork', memory='sparse', parallel=TRUE)
The parameter debug
is a boolean flag to indicate whether the debug message will be displayed or not, it’s FALSE by default. You can add it into each function.
data = readseq(dataFile=getexample(dataType="protein"), dataType="protein",debug=TRUE)
#> Reading file......
#>
#> reading sequence from file time cost:0.12 secs;memory applied:93.1 MB
#> character level:A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y
#>
#> split string to character time cost:0.47 secs;memory applied:93.4 MB
#> completed. (10,329)
#>
#> total co-occurrence.readsequence time cost:0.68 secs;memory applied:93.4 MB
#>