Performing pathway analysis is a common task in genomics and there are many available software tools, many of which are R-based. Depending on the tool, it may be necessary to import the pathways into R, translate genes to the appropriate species, convert between symbols and IDs, and format the object in the required way.
The msigdbr
R package provides Molecular Signatures Database (MSigDB) gene sets typically used with the Gene Set Enrichment Analysis (GSEA) software:
Load package.
Check the available species.
msigdbr_show_species()
#> [1] "Bos taurus" "Caenorhabditis elegans" "Canis lupus familiaris"
#> [4] "Danio rerio" "Drosophila melanogaster" "Gallus gallus"
#> [7] "Homo sapiens" "Mus musculus" "Rattus norvegicus"
#> [10] "Saccharomyces cerevisiae" "Sus scrofa"
Retrieve human genes for all gene sets in the database.
m_df = msigdbr(species = "Homo sapiens")
head(m_df)
#> # A tibble: 6 x 9
#> gs_id gs_name gs_cat gs_subcat human_gene_… species_n… entrez_g… gene_sym… sources
#> <chr> <chr> <chr> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 M12609 AAACCAC_M… C3 MIR:MIR_Le… ABCC4 Homo sapi… 10257 ABCC4 <NA>
#> 2 M12609 AAACCAC_M… C3 MIR:MIR_Le… ABRAXAS2 Homo sapi… 23172 ABRAXAS2 <NA>
#> 3 M12609 AAACCAC_M… C3 MIR:MIR_Le… ACTN4 Homo sapi… 81 ACTN4 <NA>
#> 4 M12609 AAACCAC_M… C3 MIR:MIR_Le… ACVR1 Homo sapi… 90 ACVR1 <NA>
#> 5 M12609 AAACCAC_M… C3 MIR:MIR_Le… ADAM9 Homo sapi… 8754 ADAM9 <NA>
#> 6 M12609 AAACCAC_M… C3 MIR:MIR_Le… ADAMTS5 Homo sapi… 11096 ADAMTS5 <NA>
Check the available collections and sub-collections.
m_df %>% dplyr::distinct(gs_cat, gs_subcat) %>% dplyr::arrange(gs_cat, gs_subcat)
#> # A tibble: 19 x 2
#> gs_cat gs_subcat
#> <chr> <chr>
#> 1 C1 ""
#> 2 C2 "CGP"
#> 3 C2 "CP"
#> 4 C2 "CP:BIOCARTA"
#> 5 C2 "CP:KEGG"
#> 6 C2 "CP:PID"
#> 7 C2 "CP:REACTOME"
#> 8 C3 "MIR:MIRDB"
#> 9 C3 "MIR:MIR_Legacy"
#> 10 C3 "TFT:GTRD"
#> 11 C3 "TFT:TFT_Legacy"
#> 12 C4 "CGN"
#> 13 C4 "CM"
#> 14 C5 "BP"
#> 15 C5 "CC"
#> 16 C5 "MF"
#> 17 C6 ""
#> 18 C7 ""
#> 19 H ""
Retrieve mouse genes for just the hallmark collection gene sets.
m_df = msigdbr(species = "Mus musculus", category = "H")
head(m_df)
#> # A tibble: 6 x 9
#> gs_id gs_name gs_cat gs_subcat human_gen… species_… entrez_g… gene_sym… sources
#> <chr> <chr> <chr> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 M5905 HALLMARK… H "" ABCA1 Mus musc… 11303 Abca1 Inparanoid,Ho…
#> 2 M5905 HALLMARK… H "" ABCB8 Mus musc… 74610 Abcb8 Inparanoid,Ho…
#> 3 M5905 HALLMARK… H "" ACAA2 Mus musc… 52538 Acaa2 Inparanoid,Ho…
#> 4 M5905 HALLMARK… H "" ACADL Mus musc… 11363 Acadl Inparanoid,Ho…
#> 5 M5905 HALLMARK… H "" ACADM Mus musc… 11364 Acadm Inparanoid,Ho…
#> 6 M5905 HALLMARK… H "" ACADS Mus musc… 11409 Acads Inparanoid,Ho…
Retrieve mouse C2 (curated) CGP (chemical and genetic perturbations) gene sets.
m_df = msigdbr(species = "Mus musculus", category = "C2", subcategory = "CGP")
head(m_df)
#> # A tibble: 6 x 9
#> gs_id gs_name gs_cat gs_subcat human_gen… species_… entrez_g… gene_sym… sources
#> <chr> <chr> <chr> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 M1423 ABBUD_LI… C2 CGP AHNAK Mus musc… 66395 Ahnak Inparanoid,Ho…
#> 2 M1423 ABBUD_LI… C2 CGP ANKRD40 Mus musc… 71452 Ankrd40 Inparanoid,Ho…
#> 3 M1423 ABBUD_LI… C2 CGP ARID1A Mus musc… 93760 Arid1a Inparanoid,Ho…
#> 4 M1423 ABBUD_LI… C2 CGP BCKDHB Mus musc… 12040 Bckdhb Inparanoid,Ho…
#> 5 M1423 ABBUD_LI… C2 CGP C16orf89 Mus musc… 239691 AU021092 Inparanoid,Ho…
#> 6 M1423 ABBUD_LI… C2 CGP CAPN9 Mus musc… 73647 Capn9 Inparanoid,Ho…
The msigdbr()
function output can also be manipulated as any standard data frame.
m_df = msigdbr(species = "Mus musculus") %>% dplyr::filter(gs_cat == "H")
head(m_df)
#> # A tibble: 6 x 9
#> gs_id gs_name gs_cat gs_subcat human_gen… species_… entrez_g… gene_sym… sources
#> <chr> <chr> <chr> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 M5905 HALLMARK… H "" ABCA1 Mus musc… 11303 Abca1 Inparanoid,Ho…
#> 2 M5905 HALLMARK… H "" ABCB8 Mus musc… 74610 Abcb8 Inparanoid,Ho…
#> 3 M5905 HALLMARK… H "" ACAA2 Mus musc… 52538 Acaa2 Inparanoid,Ho…
#> 4 M5905 HALLMARK… H "" ACADL Mus musc… 11363 Acadl Inparanoid,Ho…
#> 5 M5905 HALLMARK… H "" ACADM Mus musc… 11364 Acadm Inparanoid,Ho…
#> 6 M5905 HALLMARK… H "" ACADS Mus musc… 11409 Acads Inparanoid,Ho…
Use the gene sets data frame for clusterProfiler
(for genes as Entrez Gene IDs).
m_t2g = m_df %>% dplyr::select(gs_name, entrez_gene) %>% as.data.frame()
enricher(gene = gene_ids_vector, TERM2GENE = m_t2g, ...)
Use the gene sets data frame for clusterProfiler
(for genes as gene symbols).
m_t2g = m_df %>% dplyr::select(gs_name, gene_symbol) %>% as.data.frame()
enricher(gene = gene_symbols_vector, TERM2GENE = m_t2g, ...)
Use the gene sets data frame for fgsea
.
Which version of MSigDB was used?
This package was generated with MSigDB v7.1 (released March 2020). The MSigDB version is used as the base of the package version. You can check the installed version with
packageVersion("msigdbr")
.
Can I download the gene sets directly from MSigDB instead of using this package?
Yes. You can then import the GMT files (with
getGmt()
from theGSEABase
package, for example). The GMTs only include the human genes, even for gene sets generated from mouse data. If you are not working with human data, you then have to convert the MSigDB genes to your organism or your genes to human.
Can I convert between human and mouse genes just by adjusting gene capitalization?
That will work for most genes, but not all.
Can I convert human genes to any organism myself instead of using this package?
Yes. A popular method is using the
biomaRt
package. You may still end up with dozens of homologs for some genes, so additional cleanup may be helpful.
Aren’t there already other similar tools?
There are a few other resources that and provide some of the functionality and served as an inspiration for this package. Ge Lab Gene Set Files has GMT files for many species. WEHI provides MSigDB gene sets in R format for human and mouse, but the genes are provided only as Entrez IDs and each collection is a separate file. MSigDF is based on the WEHI resource, so it provides the same data, but converted to a more tidyverse-friendly data frame. When
msigdbr
was initially released, all of them were multiple releases behind the latest version of MSigDB, so they are possibly no longer maintained.
The Molecular Signatures Database (MSigDB) is a collection of gene sets originally created for use with the Gene Set Enrichment Analysis (GSEA) software.
Gene homologs are provided by HUGO Gene Nomenclature Committee at the European Bioinformatics Institute which integrates the orthology assertions predicted for human genes by eggNOG, Ensembl Compara, HGNC, HomoloGene, Inparanoid, NCBI Gene Orthology, OMA, OrthoDB, OrthoMCL, Panther, PhylomeDB, TreeFam and ZFIN. For each human equivalent within each species, only the ortholog supported by the largest number of databases is used.