Introduction to the msigdbr package

Overview

Performing pathway analysis is a common task in genomics and there are many available software tools, many of which are R-based. Depending on the tool, it may be necessary to import the pathways into R, translate genes to the appropriate species, convert between symbols and IDs, and format the object in the required way.

The msigdbr R package provides Molecular Signatures Database (MSigDB) gene sets typically used with the Gene Set Enrichment Analysis (GSEA) software:

in an R-friendly format (a data frame in a “long” format with one gene per row)
for multiple frequently studied model organisms (human, mouse, rat, pig, fly, yeast, etc.)
as both gene symbols and NCBI/Entrez Gene IDs (for better compatibility with pathway enrichment tools)
that can be used in a script without requiring additional external files

Installation

The package can be installed from CRAN.

install.packages("msigdbr")

Usage

Load package.

library(msigdbr)

Check the available species.

msigdbr_show_species()
#>  [1] "Bos taurus"               "Caenorhabditis elegans"   "Canis lupus familiaris"  
#>  [4] "Danio rerio"              "Drosophila melanogaster"  "Gallus gallus"           
#>  [7] "Homo sapiens"             "Mus musculus"             "Rattus norvegicus"       
#> [10] "Saccharomyces cerevisiae" "Sus scrofa"

Retrieve human genes for all gene sets in the database.

m_df = msigdbr(species = "Homo sapiens")
head(m_df)
#> # A tibble: 6 x 9
#>   gs_id  gs_name    gs_cat gs_subcat   human_gene_… species_n… entrez_g… gene_sym… sources
#>   <chr>  <chr>      <chr>  <chr>       <chr>        <chr>          <int> <chr>     <chr>  
#> 1 M12609 AAACCAC_M… C3     MIR:MIR_Le… ABCC4        Homo sapi…     10257 ABCC4     <NA>   
#> 2 M12609 AAACCAC_M… C3     MIR:MIR_Le… ABRAXAS2     Homo sapi…     23172 ABRAXAS2  <NA>   
#> 3 M12609 AAACCAC_M… C3     MIR:MIR_Le… ACTN4        Homo sapi…        81 ACTN4     <NA>   
#> 4 M12609 AAACCAC_M… C3     MIR:MIR_Le… ACVR1        Homo sapi…        90 ACVR1     <NA>   
#> 5 M12609 AAACCAC_M… C3     MIR:MIR_Le… ADAM9        Homo sapi…      8754 ADAM9     <NA>   
#> 6 M12609 AAACCAC_M… C3     MIR:MIR_Le… ADAMTS5      Homo sapi…     11096 ADAMTS5   <NA>

Check the available collections and sub-collections.

m_df %>% dplyr::distinct(gs_cat, gs_subcat) %>% dplyr::arrange(gs_cat, gs_subcat)
#> # A tibble: 19 x 2
#>    gs_cat gs_subcat       
#>    <chr>  <chr>           
#>  1 C1     ""              
#>  2 C2     "CGP"           
#>  3 C2     "CP"            
#>  4 C2     "CP:BIOCARTA"   
#>  5 C2     "CP:KEGG"       
#>  6 C2     "CP:PID"        
#>  7 C2     "CP:REACTOME"   
#>  8 C3     "MIR:MIRDB"     
#>  9 C3     "MIR:MIR_Legacy"
#> 10 C3     "TFT:GTRD"      
#> 11 C3     "TFT:TFT_Legacy"
#> 12 C4     "CGN"           
#> 13 C4     "CM"            
#> 14 C5     "BP"            
#> 15 C5     "CC"            
#> 16 C5     "MF"            
#> 17 C6     ""              
#> 18 C7     ""              
#> 19 H      ""

Retrieve mouse genes for just the hallmark collection gene sets.

m_df = msigdbr(species = "Mus musculus", category = "H")
head(m_df)
#> # A tibble: 6 x 9
#>   gs_id gs_name   gs_cat gs_subcat human_gen… species_… entrez_g… gene_sym… sources       
#>   <chr> <chr>     <chr>  <chr>     <chr>      <chr>         <int> <chr>     <chr>         
#> 1 M5905 HALLMARK… H      ""        ABCA1      Mus musc…     11303 Abca1     Inparanoid,Ho…
#> 2 M5905 HALLMARK… H      ""        ABCB8      Mus musc…     74610 Abcb8     Inparanoid,Ho…
#> 3 M5905 HALLMARK… H      ""        ACAA2      Mus musc…     52538 Acaa2     Inparanoid,Ho…
#> 4 M5905 HALLMARK… H      ""        ACADL      Mus musc…     11363 Acadl     Inparanoid,Ho…
#> 5 M5905 HALLMARK… H      ""        ACADM      Mus musc…     11364 Acadm     Inparanoid,Ho…
#> 6 M5905 HALLMARK… H      ""        ACADS      Mus musc…     11409 Acads     Inparanoid,Ho…

Retrieve mouse C2 (curated) CGP (chemical and genetic perturbations) gene sets.

m_df = msigdbr(species = "Mus musculus", category = "C2", subcategory = "CGP")
head(m_df)
#> # A tibble: 6 x 9
#>   gs_id gs_name   gs_cat gs_subcat human_gen… species_… entrez_g… gene_sym… sources       
#>   <chr> <chr>     <chr>  <chr>     <chr>      <chr>         <int> <chr>     <chr>         
#> 1 M1423 ABBUD_LI… C2     CGP       AHNAK      Mus musc…     66395 Ahnak     Inparanoid,Ho…
#> 2 M1423 ABBUD_LI… C2     CGP       ANKRD40    Mus musc…     71452 Ankrd40   Inparanoid,Ho…
#> 3 M1423 ABBUD_LI… C2     CGP       ARID1A     Mus musc…     93760 Arid1a    Inparanoid,Ho…
#> 4 M1423 ABBUD_LI… C2     CGP       BCKDHB     Mus musc…     12040 Bckdhb    Inparanoid,Ho…
#> 5 M1423 ABBUD_LI… C2     CGP       C16orf89   Mus musc…    239691 AU021092  Inparanoid,Ho…
#> 6 M1423 ABBUD_LI… C2     CGP       CAPN9      Mus musc…     73647 Capn9     Inparanoid,Ho…

The msigdbr() function output can also be manipulated as any standard data frame.

m_df = msigdbr(species = "Mus musculus") %>% dplyr::filter(gs_cat == "H")
head(m_df)
#> # A tibble: 6 x 9
#>   gs_id gs_name   gs_cat gs_subcat human_gen… species_… entrez_g… gene_sym… sources       
#>   <chr> <chr>     <chr>  <chr>     <chr>      <chr>         <int> <chr>     <chr>         
#> 1 M5905 HALLMARK… H      ""        ABCA1      Mus musc…     11303 Abca1     Inparanoid,Ho…
#> 2 M5905 HALLMARK… H      ""        ABCB8      Mus musc…     74610 Abcb8     Inparanoid,Ho…
#> 3 M5905 HALLMARK… H      ""        ACAA2      Mus musc…     52538 Acaa2     Inparanoid,Ho…
#> 4 M5905 HALLMARK… H      ""        ACADL      Mus musc…     11363 Acadl     Inparanoid,Ho…
#> 5 M5905 HALLMARK… H      ""        ACADM      Mus musc…     11364 Acadm     Inparanoid,Ho…
#> 6 M5905 HALLMARK… H      ""        ACADS      Mus musc…     11409 Acads     Inparanoid,Ho…

Integrating with Pathway Analysis Packages

Use the gene sets data frame for clusterProfiler (for genes as Entrez Gene IDs).

m_t2g = m_df %>% dplyr::select(gs_name, entrez_gene) %>% as.data.frame()
enricher(gene = gene_ids_vector, TERM2GENE = m_t2g, ...)

Use the gene sets data frame for clusterProfiler (for genes as gene symbols).

m_t2g = m_df %>% dplyr::select(gs_name, gene_symbol) %>% as.data.frame()
enricher(gene = gene_symbols_vector, TERM2GENE = m_t2g, ...)

Use the gene sets data frame for fgsea.

m_list = m_df %>% split(x = .$gene_symbol, f = .$gs_name)
fgsea(pathways = m_list, ...)

Questions and Concerns

Which version of MSigDB was used?

This package was generated with MSigDB v7.1 (released March 2020). The MSigDB version is used as the base of the package version. You can check the installed version with packageVersion("msigdbr").

Can I download the gene sets directly from MSigDB instead of using this package?

Yes. You can then import the GMT files (with getGmt() from the GSEABase package, for example). The GMTs only include the human genes, even for gene sets generated from mouse data. If you are not working with human data, you then have to convert the MSigDB genes to your organism or your genes to human.

Can I convert between human and mouse genes just by adjusting gene capitalization?

That will work for most genes, but not all.

Can I convert human genes to any organism myself instead of using this package?

Yes. A popular method is using the biomaRt package. You may still end up with dozens of homologs for some genes, so additional cleanup may be helpful.

Aren’t there already other similar tools?

There are a few other resources that and provide some of the functionality and served as an inspiration for this package. Ge Lab Gene Set Files has GMT files for many species. WEHI provides MSigDB gene sets in R format for human and mouse, but the genes are provided only as Entrez IDs and each collection is a separate file. MSigDF is based on the WEHI resource, so it provides the same data, but converted to a more tidyverse-friendly data frame. When msigdbr was initially released, all of them were multiple releases behind the latest version of MSigDB, so they are possibly no longer maintained.

Details

The Molecular Signatures Database (MSigDB) is a collection of gene sets originally created for use with the Gene Set Enrichment Analysis (GSEA) software.

Gene homologs are provided by HUGO Gene Nomenclature Committee at the European Bioinformatics Institute which integrates the orthology assertions predicted for human genes by eggNOG, Ensembl Compara, HGNC, HomoloGene, Inparanoid, NCBI Gene Orthology, OMA, OrthoDB, OrthoMCL, Panther, PhylomeDB, TreeFam and ZFIN. For each human equivalent within each species, only the ortholog supported by the largest number of databases is used.