The goal of GeoTcgaData is to deal with RNA-seq, DNA Methylation, and Copy number variation data in GEO and TCGA.
Erqiang Hu
College of Bioinformatics Science and Technology, Harbin Medical University
Get the development version from github:
if(!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
devtools::install_github("huerqiang/GeoTcgaData")
Or the released version from CRAN:
GEO and TCGA provide us with a wealth of data, such as RNA-seq, DNA Methylation, and Copy number variation data. It’s easy to download data from TCGA using the gdc tool, but processing these data into a format suitable for bioinformatics analysis requires more work. This R package was developed to handle these data.
This is a basic example which shows you how to solve a common problem:
The function classify_sample
and diff_gene
could get the differentially expressioned genes using DESeq2
package. For examples:
The parameter kegg_liver
is a matrix or data.frame of gene expression data(count) in TCGA.
The function Merge_methy_tcga could Merge methylation data downloaded from TCGA. This makes it easier to extract differentially methylated genes in the downstream analysis. For example:
dirr = system.file(file.path("extdata","methy"),package="GeoTcgaData")
merge_result <- Merge_methy_tcga(dirr)
The function ann_merge
could merge the copy number variation data downloaded from TCGA using gdc. For example:
metadatafile_name <- "metadata.cart.2018-11-09.json"
jieguo2 <- ann_merge(dirr = system.file(file.path("extdata","cnv"),package="GeoTcgaData"),metadatafile=metadatafile_name)
The parameter dirr
is a string for the direction of copy number variation data downloaded from TCGA. The parameter metadatafile
is the metadata file download from TCGA. The function prepare_chi
and differential_cnv
could do chi-square test to find copy number variation differential genes. For example:
jieguo3 <- matrix(c(-1.09150,-1.47120,-0.87050,-0.50880,
-0.50880,2.0,2.0,2.0,2.0,2.0,2.601962,2.621332,2.621332,
2.621332,2.621332,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,
2.0,2.0,2.0,2.0,2.0,2.0,2.0),nrow=5)
rownames(jieguo3) <- c("AJAP1", "FHAD1", "CLCNKB", "CROCCP2", "AL137798.3")
colnames(jieguo3) <- c("TCGA-DD-A4NS-10A-01D-A30U-01", "TCGA-ED-A82E-01A-11D-A34Y-01",
"TCGA-WQ-A9G7-01A-11D-A36W-01", "TCGA-DD-AADN-01A-11D-A40Q-01",
"TCGA-ZS-A9CD-10A-01D-A36Z-01", "TCGA-DD-A1EB-11A-11D-A12Y-01")
rt <- prepare_chi(jieguo3)
chiResult <- differential_cnv(rt)
The parameter of prepare_chi
is the result of function ann_merge
and the parameter of function differential_cnv
is the result of prepare_chi.
The function gene_ave
could average the expression data of different ids for the same gene in the GEO chip data. For example:
aa <- c("Gene Symbol", "MARCH1", "MARC1", "MARCH1", "MARCH1", "MARCH1")
bb <- c("GSM1629982", "2.969058399", "4.722410064", "8.165514853", "8.24243893", "8.60815086")
cc <- c("GSM1629982", "3.969058399", "5.722410064", "7.165514853", "6.24243893", "7.60815086")
file1 <- data.frame(aa=aa,bb=bb,cc=cc)
result <- gene_ave(file1)
Multiple genes symbols may correspond to a same chip id. The result of function rep1
is to assign the expression of this id to each gene, and function rep2
deletes the expression. For example:
aa <- c("MARCH1 /// MMA","MARC1","MARCH2 /// MARCH3",
"MARCH3 /// MARCH4","MARCH1")
bb <- c("2.969058399","4.722410064","8.165514853","8.24243893","8.60815086")
cc <- c("3.969058399","5.722410064","7.165514853","6.24243893","7.60815086")
input_fil <- data.frame(aa=aa,bb=bb,cc=cc)
rep1_result <- rep1(input_fil," /// ")
rep2_result <- rep2(input_fil," /// ")
id_conversion_vector
could convert gene id from one of symbol
, RefSeq_ID
, Ensembl_ID
, NCBI_Gene_ID
, UCSC_ID
, and UniProt_ID
, etc. to another. Use id_ava()
to get all the convertible ids. For example:id_conversion_vector("symbol", "ensembl_gene_id", c("A2ML1", "A2ML1-AS1", "A4GALT", "A12M1", "AAAS"))
When the user converts the Ensembl ID to other ids, the version number needs to be removed. For example, “ENSG00000186092.4” doesn’t work, you need to change it to “ENSG00000186092”.
Especially, the function id_conversion could convert ENSEMBL gene id to gene Symbol in TCGA. For example:
The parameter profile is a data.frame or matrix of gene expression data in TCGA.
countToFpkm_matrix
and countToTpm_matrix
could convert count data to FPKM or TPM data.lung_squ_count2 <- matrix(c(1,2,3,4,5,6,7,8,9),ncol=3)
rownames(lung_squ_count2) <- c("DISC1","TCOF1","SPPL3")
colnames(lung_squ_count2) <- c("sample1","sample2","sample3")
jieguo <- countToFpkm_matrix(lung_squ_count2)
lung_squ_count2 <- matrix(c(11,22,23,14,15,6,17,18,29),ncol=3)
rownames(lung_squ_count2) <- c("DISC1","TCOF1","SPPL3")
colnames(lung_squ_count2) <- c("sample1","sample2","sample3")
jieguo <- countToTpm_matrix(lung_squ_count2)
tcga_cli_deal
could combine clinical information obtained from TCGA and extract survival data. For example: