Overview
A typical computational pipeline to process single-cell RNA sequencing (scRNA-seq) data includes clustering of cells as one of the steps. Assignment of cell type labels to those clusters is often a time-consuming process that involves manual inspection of the cluster marker genes complemented with a detailed literature search. This is especially challenging when unexpected or poorly described populations are present. The clustermole R package provides methods to query thousands of human and mouse cell identity markers sourced from a variety of databases.
The clustermole package provides three primary features:
- cell type prediction based on marker genes (
clustermole_overlaps
) - cell type prediction based on a full expression matrix (
clustermole_enrichment
) - a database of cell type markers (
clustermole_markers
)
Usage
Install clustermole if it is not yet available on your system.
Load clustermole.
clustermole_overlaps()
: cell type prediction based on marker genes
If you have a set of genes, such as cluster markers, you can compare them to known cell type markers to see if they overlap any of the known cell type markers (overrepresentation analysis).
my_genes = c("CD2", "CD3D", "CD3E", "IL7R", "IL32", "LTB", "LDHB", "CCR7")
my_overlaps = clustermole_overlaps(genes = my_genes, species = "hs")
my_overlaps
#> # A tibble: 389 x 9
#> db species organ celltype celltype_full n_genes overlap p_value fdr
#> <chr> <chr> <chr> <chr> <chr> <int> <dbl> <dbl> <dbl>
#> 1 Pangl… Human Immu… T memor… T memory cel… 54 6 3.90e-15 5.51e-12
#> 2 SCSig Human Cent… Fan_Emb… Fan_Embryoni… 150 7 4.30e-15 5.51e-12
#> 3 Pangl… Mouse Immu… T memor… T memory cel… 57 6 7.56e-15 6.46e-12
#> 4 CellM… Human Peri… T cell T cell | Per… 19 5 1.58e-14 1.01e-11
#> 5 CellM… Human Kidn… T helpe… T helper cel… 5 4 4.06e-14 2.08e-11
#> 6 Pangl… Human Immu… T cells T cells | Im… 95 6 1.31e-13 5.12e-11
#> 7 Pangl… Mouse Immu… T cells T cells | Im… 93 6 1.40e-13 5.12e-11
#> 8 SaVanT "" "" CD3plus… CD3plus_T-ce… 50 5 2.87e-12 5.65e-10
#> 9 SaVanT Human "" HPCA_T_… HPCA_T_cells… 50 5 2.87e-12 5.65e-10
#> 10 SaVanT Mouse "" IMGN_T_… IMGN_T_4Nve_… 50 5 2.87e-12 5.65e-10
#> # … with 379 more rows
clustermole_enrichment()
: cell type enrichment in the full expression matrix
If you have a table of expression values, such as average expression across clusters, you can perform cell type enrichment based on a given gene expression matrix (log-transformed CPM/TPM/FPKM values). Genes are rows and clusters/samples are columns.
clustermole_markers()
: retrieve cell type markers
You can use clustermole
as a simple database and get a data frame of all cell type markers.
markers = clustermole_markers(species = "hs")
markers
#> # A tibble: 163,509 x 8
#> db species organ celltype celltype_full n_genes gene_original gene
#> <chr> <chr> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 ACCSL ACCSL
#> 2 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 ACVR1B ACVR…
#> 3 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 ARHGEF16 ARHG…
#> 4 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 ASF1B ASF1B
#> 5 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 BCL2L10 BCL2…
#> 6 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 BLCAP BLCAP
#> 7 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 BNIP1 BNIP1
#> 8 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 C1orf210 C1or…
#> 9 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 C1orf226 C1or…
#> 10 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 CASC3 CASC3
#> # … with 163,499 more rows
Each row contains a gene and a cell type associated with it. The gene
column is the gene symbol (human or mouse versions can be retrieved) and the celltype_full
column contains the full cell type string, including the species and the original database.
If you need to convert the markers from a data frame to a list format for other applications, you can use gene
as the values and celltype_full
as the grouping variable.
Database details
We will load dplyr to help with the summary statistics.
You can use clustermole_markers()
to retrieve a data frame of all cell type markers in the collection.
markers = clustermole_markers(species = "hs")
markers
#> # A tibble: 163,509 x 8
#> db species organ celltype celltype_full n_genes gene_original gene
#> <chr> <chr> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 ACCSL ACCSL
#> 2 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 ACVR1B ACVR…
#> 3 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 ARHGEF16 ARHG…
#> 4 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 ASF1B ASF1B
#> 5 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 BCL2L10 BCL2…
#> 6 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 BLCAP BLCAP
#> 7 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 BNIP1 BNIP1
#> 8 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 C1orf210 C1or…
#> 9 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 C1orf226 C1or…
#> 10 CellM… Human Embryo 1-cell s… 1-cell stage cel… 45 CASC3 CASC3
#> # … with 163,499 more rows
Check the number of available cell types.
Check the number of available cell types per species (not available for every cell type).
markers %>% distinct(celltype_full, species) %>% count(species, sort = TRUE)
#> # A tibble: 3 x 2
#> species n
#> <chr> <int>
#> 1 Human 1618
#> 2 Mouse 730
#> 3 "" 215
Check the number of available cell types per organ (not available for every cell type).
markers %>% distinct(celltype_full, organ) %>% count(organ, sort = TRUE)
#> # A tibble: 117 x 2
#> organ n
#> <chr> <int>
#> 1 "" 1282
#> 2 Brain 127
#> 3 Central Nervous System 88
#> 4 Digestive System 63
#> 5 Kidney 56
#> 6 Lung 52
#> 7 Bone marrow 51
#> 8 Immune system 50
#> 9 Peripheral blood 46
#> 10 Hematopoietic system 44
#> # … with 107 more rows
Check package version.