Overview
gprofiler2 provides an R interface to the widely used web toolset g:Profiler (https://biit.cs.ut.ee/gprofiler) [1].
The toolset performs functional enrichment analysis and visualization of gene lists, converts gene/protein/SNP identifiers to numerous namespaces, and maps orthologous genes across species. g:Profiler relies on Ensembl databases as the primary data source and follows their release cycle for updates.
The main tools in g:Profiler are:
- g:GOSt - functional enrichment analysis of gene lists
- g:Convert - gene/protein/transcript identifier conversion across various namespaces
- g:Orth - orthology search across species
The input for any of the tools can consist of mixed types of gene identifiers, SNP rs-IDs, chromosomal intervals or term IDs. The gene IDs from chromosomal regions are retrieved automatically. The gene doesn’t need to fit the region fully. The format for chromosome regions is chr:region_start:region_end, e.g. X:1:2000000. In case of term IDs like GO:0007507 (heart development), g:Profiler uses all the genes annotated to that term as an input (in this case about six hundred human genes associated to heart development). Fully numeric identifiers need to be prefixed with the corresponding namespace. g:Profiler will automatically prefix all the detected numeric IDs using the prefix determined by the selected numeric namespace parameter.
- g:SNPense - mapping SNP rs-identifiers to chromosome positions, protein coding genes and variant effects
Corresponding functions in the gprofiler2 R package are:
gprofiler2 uses the publicly available APIs of the g:Profiler web tool which ensures that the results from all of the interfaces are consistent.
The package corresponds to the 2019 update of g:Profiler and provides access for versions e94_eg41_p11 and higher. The older versions are available from the previous R package gProfileR.
Gene list functional enrichment analysis with gost
Enrichment analysis
gost
enables to perform functional profiling of gene lists. The function performs statistical enrichment analysis to find over-representation of functions from Gene Ontology, biological pathways like KEGG and Reactome, human disease annotations, etc. This is done with the hypergeometric test followed by correction for multiple testing.
A standard input of the gost
function is a (named) list of gene identifiers. The list can consist of mixed types of identifiers (proteins, transcripts, microarray IDs, etc), SNP IDs, chromosomal intervals or functional term IDs.
The parameter organism
enables to define the corresponding source organism for the gene list. The organism names are usually constructed by concatenating the first letter of the name and the family name, e.g human - hsapiens. If some of the input gene identifiers are fully numeric, the parameter numeric_ns
enables to define the corresponding namespace. See section Supported organisms and identifier namespaces for links to supported organisms and namespaces.
If the input genes are decreasingly ordered based on some biological importance, then ordered_query = TRUE
will take this into account. For instance, the genes can be ordered according to differential expression or absolute expression values. In this case, incremental enrichment testing is performed with increasingly larger numbers of genes starting from the top of the list. Note that with this parameter, the query size might be different for every functional term.
The parameter significant = TRUE
is an indicator whether all or only statistically significant results should be returned.
In case of Gene Ontology (GO), the exclude_iea = TRUE
would exclude the electronic GO annotations from the data source before testing. These are the terms with the IEA evidence code indicating that these annotations are assigned to genes using in silico curation methods.
In order to measure under-representation instead of over-representation set measure_underrepresentation = TRUE
.
By default, the user_threshold = 0.05
which defines a custom p-value significance threshold for the results. Results with smaller p-value are tagged as significant. We don’t recommend to set it higher than 0.05.
In order to reduce the amount of false positives, a multiple testing correction method is applied to the enrichment p-values. By default, our tailor-made algorithm g:SCS is used (correction_method = "gSCS"
with synonyms g_SCS
and analytical
), but there are also options to apply the Bonferroni correction (correction_method = "bonferroni"
) or FDR (correction_method = "fdr"
). The adjusted p-values are reported in the results.
The parameter domain_scope
defines how the statistical domain size is calculated. This is one of the parameters in the hypergeometric probability function. If domain_scope = "annotated"
then only the genes with at least one annotation are considered to be part of the full domain. In case if domain_scope = "known"
then all the genes of the given organism are considered to be part of the domain.
Depending on the research question, in some occasions it is advisable to limit the domain/background set. For example, one may use the custom background when they want to compare a gene list with a custom list of expressed genes. gost
provides the means to define a custom background as a (mixed) list of gene identifiers with the parameter custom_bg
. If this parameter is used, then the domain scope is set to domain_scope = "custom"
. It is also possible to set this parameter to domain_scope = "custom_annotated"
which will use the set of genes that are annotated in the data source and are also included in the user provided background list.
The parameter sources
enables to choose the data sources of interest. By default, all the sources are analysed. The available data sources and their abbreviations are listed under section Data sources. For example, if sources = c("GO:MF", "REAC")
then only the results from molecular functions branch of Gene Ontology and the pathways from Reactome are returned. One can also upload their own annotation data which is further described in the section Custom data sources with upload_GMT_file
.
gostres <- gost(query = c("X:1000:1000000", "rs17396340", "GO:0005005", "ENSG00000156103", "NLRP1"),
organism = "hsapiens", ordered_query = FALSE,
multi_query = FALSE, significant = TRUE, exclude_iea = FALSE,
measure_underrepresentation = FALSE, evcodes = FALSE,
user_threshold = 0.05, correction_method = "g_SCS",
domain_scope = "annotated", custom_bg = NULL,
numeric_ns = "", sources = NULL, as_short_link = FALSE)
The result is a named list
where “result” is a data.frame
with the enrichment analysis results and “meta” containing a named list
with all the metadata for the query.
head(gostres$result)
#> query significant p_value term_size query_size intersection_size
#> 1 query_1 TRUE 8.590516e-26 87 20 14
#> 2 query_1 TRUE 2.744982e-18 284 20 14
#> 3 query_1 TRUE 3.033447e-18 286 20 14
#> 4 query_1 TRUE 5.608704e-15 487 20 14
#> 5 query_1 TRUE 1.885775e-14 531 20 14
#> 6 query_1 TRUE 1.341072e-13 611 20 14
#> precision recall term_id source
#> 1 0.7 0.16091954 GO:0048013 GO:BP
#> 2 0.7 0.04929577 GO:0007411 GO:BP
#> 3 0.7 0.04895105 GO:0097485 GO:BP
#> 4 0.7 0.02874743 GO:0007409 GO:BP
#> 5 0.7 0.02636535 GO:0061564 GO:BP
#> 6 0.7 0.02291326 GO:0048667 GO:BP
#> term_name effective_domain_size
#> 1 ephrin receptor signaling pathway 17906
#> 2 axon guidance 17906
#> 3 neuron projection guidance 17906
#> 4 axonogenesis 17906
#> 5 axon development 17906
#> 6 cell morphogenesis involved in neuron differentiation 17906
#> source_order parents
#> 1 14784 GO:0007169
#> 2 3330 GO:0007409, GO:0097485
#> 3 22319 GO:0006928, GO:0006935, GO:0048812
#> 4 3329 GO:0048667, GO:0048812, GO:0061564
#> 5 18618 GO:0031175
#> 6 15267 GO:0000904, GO:0048666
The result data.frame
contains the following columns:
- query - the name of the input query which by default is the order of query with the prefix “query_”. This can be changed by using a named
list
input. - significant - indicator for statistically significant results
- p_value - hypergeometric p-value after correction for multiple testing
- term_size - number of genes that are annotated to the term
- query_size - number of genes that were included in the query. This might be different from the size of the original list if:
- any genes were mapped to multiple Ensembl gene IDs
- any genes failed to be mapped to Ensembl gene IDs
- the parameter
ordered_query = TRUE
and the optimal cutoff for the term was found before the end of the query - the
domain_scope
was set to “annotated” or “custom”
- intersection_size - the number of genes in the input query that are annotated to the corresponding term
- precision - the proportion of genes in the input list that are annotated to the function (defined as intersection_size/query_size)
- recall - the proportion of functionally annotated genes that the query recovers (defined as intersection_size/term_size)
- term_id - unique term identifier (e.g GO:0005005)
- source - the abbreviation of the data source for the term (e.g. GO:BP)
- term_name - the short name of the function
- effective_domain_size - the total number of genes “in the universe” used for the hypergeometric test
- source_order - numeric order for the term within its data source (this is important for drawing the results)
- parents - list of term IDs that are hierarchically directly above the term. For non-hierarchical data sources this points to an artificial root node.
names(gostres$meta)
#> [1] "query_metadata" "result_metadata" "genes_metadata" "timestamp"
#> [5] "version"
The query parameters are listed in the “query_metadata” part of the metadata object. The “result_metadata” includes the statistics of data sources that are used in the enrichment testing. This includes the “domain_size” showing the number of genes annotated to this domain. The “number_of_terms” indicating the number of terms g:Profiler has in the database for this source and the nominal significance “threshold” for this source. The “genes_metadata” shows the specifics of the query genes (failed, ambiguous or duplicate inputs) and their mappings to the ENSG namespace. In addition, the query time and the used g:Profiler data version are shown in the metadata.
The parameter evcodes = TRUE
includes the evidence codes to the results. In addition, a column “intersection” will appear to the results showing the input gene IDs that intersect with the corresponding functional term. Note that his parameter can decrease the performance and make the query slower.
gostres2 <- gost(query = c("X:1000:1000000", "rs17396340", "GO:0005005", "ENSG00000156103", "NLRP1"),
organism = "hsapiens", ordered_query = FALSE,
multi_query = FALSE, significant = TRUE, exclude_iea = FALSE,
measure_underrepresentation = FALSE, evcodes = TRUE,
user_threshold = 0.05, correction_method = "g_SCS",
domain_scope = "annotated", custom_bg = NULL,
numeric_ns = "", sources = NULL)
head(gostres2$result)
#> query significant p_value term_size query_size intersection_size
#> 1 query_1 TRUE 8.590516e-26 87 20 14
#> 2 query_1 TRUE 2.744982e-18 284 20 14
#> 3 query_1 TRUE 3.033447e-18 286 20 14
#> 4 query_1 TRUE 5.608704e-15 487 20 14
#> 5 query_1 TRUE 1.885775e-14 531 20 14
#> 6 query_1 TRUE 1.341072e-13 611 20 14
#> precision recall term_id source
#> 1 0.7 0.16091954 GO:0048013 GO:BP
#> 2 0.7 0.04929577 GO:0007411 GO:BP
#> 3 0.7 0.04895105 GO:0097485 GO:BP
#> 4 0.7 0.02874743 GO:0007409 GO:BP
#> 5 0.7 0.02636535 GO:0061564 GO:BP
#> 6 0.7 0.02291326 GO:0048667 GO:BP
#> term_name effective_domain_size
#> 1 ephrin receptor signaling pathway 17906
#> 2 axon guidance 17906
#> 3 neuron projection guidance 17906
#> 4 axonogenesis 17906
#> 5 axon development 17906
#> 6 cell morphogenesis involved in neuron differentiation 17906
#> source_order parents
#> 1 14784 GO:0007169
#> 2 3330 GO:0007409, GO:0097485
#> 3 22319 GO:0006928, GO:0006935, GO:0048812
#> 4 3329 GO:0048667, GO:0048812, GO:0061564
#> 5 18618 GO:0031175
#> 6 15267 GO:0000904, GO:0048666
#> evidence_codes
#> 1 IDA TAS IEA,ISS TAS IEA,TAS IEA,IDA IBA TAS,IGI TAS IEA,ISS TAS IEA,IDA TAS IEA,IDA TAS IEA,IGI ISS IBA TAS IEA,ISS TAS IEA,TAS,IDA TAS IEA,TAS IEA,IBA TAS IEA
#> 2 IBA,ISS IBA IEA,IBA,IBA IEA,ISS IBA IEA,ISS IEA,IBA IEA,IBA,IBA,ISS IBA IEA,IBA,ISS IEA,IBA,IBA
#> 3 IBA,ISS IBA IEA,IBA,IBA IEA,ISS IBA IEA,ISS IEA,IBA IEA,IBA,IBA,ISS IBA IEA,IBA,ISS IEA,IBA,IBA
#> 4 IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,ISS IBA IEA,IBA IEA,IBA,IBA,ISS IBA IEA,IBA,ISS IEA,IBA,IBA
#> 5 ISS IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,ISS IBA IEA,IBA IEA,IBA,IBA,ISS IBA IEA,IBA,ISS IEA,IBA,IBA
#> 6 IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,ISS IBA IEA,IBA IEA,IBA,IBA,ISS IBA IEA,IBA,ISS IEA,IBA,IBA
#> intersection
#> 1 ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000183317,ENSG00000243364
#> 2 ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000183317,ENSG00000243364
#> 3 ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000183317,ENSG00000243364
#> 4 ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000183317,ENSG00000243364
#> 5 ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000183317,ENSG00000243364
#> 6 ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000183317,ENSG00000243364
The result data.frame
will include additional columns:
- evidence_codes - a lists of all evidence codes for the intersecting genes between input and the term. The evidences are separated by comma for each gene.
- intersection - a comma separated list of genes from the query that are annotated to the corresponding term
The query results can also be gathered into a short-link to the g:Profiler web tool. For that, set the parameter as_short_link = TRUE
. In this case, the function gost()
returns only the web tool link to the results as a character string. For example, this is useful when you discover an interesting result you want to instantly share with your colleagues. Then you can just programmatically generate the short-link and copy it to your colleagues.
gostres_link <- gost(query = c("X:1000:1000000", "rs17396340", "GO:0005005", "ENSG00000156103", "NLRP1"),
as_short_link = TRUE)
This query returns a short-link of form https://biit.cs.ut.ee/gplink/l/HfapQyB5TJ.
Multiple queries
The function gost
also allows to perform enrichment on multiple input gene lists. Multiple queries are automatically detected if the input query
is a list
of vectors with gene identifiers and the results are combined into identical data.frame
as in case of single query.
multi_gostres1 <- gost(query = list("chromX" = c("X:1000:1000000", "rs17396340",
"GO:0005005", "ENSG00000156103", "NLRP1"),
"chromY" = c("Y:1:10000000", "rs17396340",
"GO:0005005", "ENSG00000156103", "NLRP1")),
multi_query = FALSE)
head(multi_gostres1$result)
#> query significant p_value term_size query_size intersection_size
#> 1 chromX TRUE 8.590516e-26 87 20 14
#> 2 chromX TRUE 2.744982e-18 284 20 14
#> 3 chromX TRUE 3.033447e-18 286 20 14
#> 4 chromX TRUE 5.608704e-15 487 20 14
#> 5 chromX TRUE 1.885775e-14 531 20 14
#> 6 chromX TRUE 1.341072e-13 611 20 14
#> precision recall term_id source
#> 1 0.7 0.16091954 GO:0048013 GO:BP
#> 2 0.7 0.04929577 GO:0007411 GO:BP
#> 3 0.7 0.04895105 GO:0097485 GO:BP
#> 4 0.7 0.02874743 GO:0007409 GO:BP
#> 5 0.7 0.02636535 GO:0061564 GO:BP
#> 6 0.7 0.02291326 GO:0048667 GO:BP
#> term_name effective_domain_size
#> 1 ephrin receptor signaling pathway 17906
#> 2 axon guidance 17906
#> 3 neuron projection guidance 17906
#> 4 axonogenesis 17906
#> 5 axon development 17906
#> 6 cell morphogenesis involved in neuron differentiation 17906
#> source_order parents
#> 1 14784 GO:0007169
#> 2 3330 GO:0007409, GO:0097485
#> 3 22319 GO:0006928, GO:0006935, GO:0048812
#> 4 3329 GO:0048667, GO:0048812, GO:0061564
#> 5 18618 GO:0031175
#> 6 15267 GO:0000904, GO:0048666
The column “query” in the result data.frame
will now contain the corresponding name for the query. If no name is specified, then the query name is defined as the order of query with the prefix “query_”.
Another option for multiple gene lists is setting the parameter multiquery = TRUE
. Then the results from all of the input queries are grouped according to term IDs for better comparison.
multi_gostres2 <- gost(query = list("chromX" = c("X:1000:1000000", "rs17396340",
"GO:0005005", "ENSG00000156103", "NLRP1"),
"chromY" = c("Y:1:10000000", "rs17396340",
"GO:0005005", "ENSG00000156103", "NLRP1")),
multi_query = TRUE)
head(multi_gostres2$result)
#> term_id p_values significant term_size
#> 1 GO:0005005 8.109480e-42, 6.691352e-40 TRUE, TRUE 14
#> 2 GO:0005003 9.412707e-38, 7.756674e-36 TRUE, TRUE 19
#> 3 REAC:R-HSA-3928665 2.234758e-28, 9.074178e-28 TRUE, TRUE 49
#> 4 GO:0004714 3.811384e-28, 3.104555e-26 TRUE, TRUE 64
#> 5 GO:0019199 1.739726e-26, 1.410512e-24 TRUE, TRUE 82
#> 6 GO:0048013 8.590516e-26, 1.069374e-21 TRUE, TRUE 87
#> query_sizes intersection_sizes source
#> 1 21, 26 14, 14 GO:MF
#> 2 21, 26 14, 14 GO:MF
#> 3 18, 19 14, 14 REAC
#> 4 21, 26 14, 14 GO:MF
#> 5 21, 26 14, 14 GO:MF
#> 6 20, 32 14, 14 GO:BP
#> term_name effective_domain_size
#> 1 transmembrane-ephrin receptor activity 18126
#> 2 ephrin receptor activity 18126
#> 3 EPH-ephrin mediated repulsion of cells 10594
#> 4 transmembrane receptor protein tyrosine kinase activity 18126
#> 5 transmembrane receptor protein kinase activity 18126
#> 6 ephrin receptor signaling pathway 17906
#> source_order parents
#> 1 1558 GO:0005003
#> 2 1556 GO:0004714
#> 3 697 REAC:R-HSA-2682334
#> 4 1305 GO:0004713, GO:0019199
#> 5 4569 GO:0004672, GO:0004888
#> 6 14784 GO:0007169
The result data.frame
contains the following columns:
- term_id - unique term identifier (e.g GO:0005005)
- p_values - hypergeometric p-values after correction for multiple testing in the order of input queries
- significant - indicators in the order of input queries for statistically significant results
- term_size - number of genes that are annotated to the term
- query_sizes - number of genes that were included in the query in the order of input queries
- intersection_sizes - the number of genes in the input query that are annotated to the corresponding term in the order of input queries
- source - the abbreviation of the data source for the term (e.g. GO:BP)
- term_name - the short name of the function
- effective_domain_size - the total number of genes “in the universe” used for the hypergeometric test
- source_order - numeric order for the term within its data source (this is important for drawing the results)
- source_order - numeric order for the term within its data source (this is important for drawing the results)
- parents - list of term IDs that are hierarchically directly above the term. For non-hierarchical data sources this points to an artificial root node.
Visualization
A major update in this package is providing the functionality to produce similar visualizations as are now available from the web tool.
The enrichment results are visualized with a Manhattan-like-plot using the function gostplot
and the previously found gost
results gostres
:
The x-axis represents the functional terms that are grouped and color-coded according to data sources and positioned according to the fixed “source_order”. The order is defined in a way that terms that are closer to each other in the source hierarchy are also next to each other in the Manhattan plot. The source colors are adjustable with the parameter pal
that defines the color map with a named list
.
The y-axis shows the adjusted p-values in negative log10 scale. Every circle is one term and is sized according to the term size, i.e larger terms have larger circles. If interactive = TRUE
, then an interactive plot is returned using the plotly
package. Hovering over the circle will show the corresponding information.
The parameter capped = TRUE
is an indicator whether the -log10(p-values) would be capped at 16 if bigger than 16. This fixes the scale of y-axis to keep Manhattan plots from different queries comparable and is also intuitive as, statistically, p-values smaller than that can all be summarised as highly significant.
If interactive = FALSE
, then the function returns a static ggplot
object.
The function publish_gostplot
takes the static plot object as an input and enables to highlight a selection of interesting terms from the results with numbers and table of results. These can be set with parameter highlight_terms
listing the term IDs in a vector
or as a data.frame
with column “term_id” such as a subset of the result data.frame
.
pp <- publish_gostplot(p, highlight_terms = c("GO:0048013", "REAC:R-HSA-3928663"),
width = NA, height = NA, filename = NULL )
The function also allows to save the result into an image file into PNG, PDF, JPEG, TIFF or BMP with parameter filename
. The plot width and height can be adjusted with corresponding parameters width
and height
.
If additional graphical parameters like increased resolution are needed, then the plot objects can easily be saved using the function ggsave
from the package ggplot2
.
The gost
results can also be visualized with a table. The publish_gosttable
will create a nice-looking table with the result statistics for the highlight_terms
from the result data.frame
. The highlight_terms
can be a vector of term IDs or a subset of the results.
publish_gosttable(gostres, highlight_terms = gostres$result[c(1:2,10,100:102,120,124,125),],
use_colors = TRUE,
show_columns = c("source", "term_name", "term_size", "intersection_size"),
filename = NULL)
#> The input 'highlight_terms' is a data.frame. The column 'term_id' will be used.
The parameter use_colors = FALSE
indicates that the p-values column should not be highlighted with background colors. The show_columns
is used to list the names of additional columns to show in the table in addition to the “term_id” and “p_value”.
The same functions work also in case of multiquery results showing multiple Manhattan plots on top of each other:
Note that if a term is clicked on one of the Manhattan plots, it is also highlighted in the others (if it is present) enabling to compare the multiple queries. The insignificant terms are shown with lighter color.
Data sources
Available data sources and their abbreviations are:
- Gene Ontology (GO or by branch GO:MF, GO:BP, GO:CC)
- KEGG (KEGG)
- Reactome (REAC)
- WikiPathways (WP)
- TRANSFAC (TF)
- miRTarBase (MIRNA)
- Human Protein Atlas (HPA)
- CORUM (CORUM)
- Human phenotype ontology (HP)
Custom data sources with upload_GMT_file
In addition to the available GO, KEGG, etc data sources, users can upload their own custom data source using the Gene Matrix Transposed file format (GMT). The file format is described in here. The users can compose the files themselves or use pre-compiled gene sets from available dedicated websites like Molecular Signatures Database (MSigDB), etc. The GMT files for g:Profiler default sources (except for KEGG and Transfac as we are restricted by data source licenses) are downloadabale from the Data sources section in g:Profiler.
upload_GMT_file
enables to upload GMT file(s). The input gmtfile
is the filename of the GMT file together with the path to the file. The input can also be several GMT files compressed into a ZIP file. The file extension should be .gmt or .zip in case of multiple GMT files. The uploaded filename is used to define the source name in the enrichment results.
For example, using the BioCarta gene sets downloaded from the [MSigDB Collections] (http://software.broadinstitute.org/gsea/msigdb/collections.jsp#C7)
download.file(url = "http://software.broadinstitute.org/gsea/resources/msigdb/7.0/c2.cp.biocarta.v7.0.symbols.gmt", destfile = "extdata/biocarta.gmt")
The result is a string that denotes the unique ID of the uploaded data source in the g:Profiler database. In this examaple, the ID is gp__TEXF_hZLM_d18.
After the upload, this ID can be used as a value for the parameter organism
in the gost
function. The input query
should consist of identifiers that are available in the GMT file. Note that all the genes in the GMT file define the domain size and therefore it is not sufficient to include only the selection of interesting terms to the file.
custom_gostres <- gost(query = c("MAPK3", "PIK3C2G", "HRAS", "PIK3R1", "MAP2K1",
"RAF1", "PLCG1", "GNAQ", "MAPK1", "PRKCB", "CRK", "BCAR1", "NFKB1"),
organism = "gp__TEXF_hZLM_d18")
#> Detected custom GMT source request
head(custom_gostres$result)
#> query significant p_value term_size query_size intersection_size
#> 1 query_1 TRUE 5.324995e-26 19 13 13
#> 2 query_1 TRUE 1.690996e-12 27 13 9
#> 3 query_1 TRUE 8.094988e-12 19 13 8
#> 4 query_1 TRUE 2.320289e-10 27 13 8
#> 5 query_1 TRUE 3.003286e-10 16 13 7
#> 6 query_1 TRUE 6.197109e-09 39 13 8
#> precision recall term_id source
#> 1 1.0000000 0.6842105 BIOCARTA_CXCR4_PATHWAY biocarta
#> 2 0.6923077 0.3333333 BIOCARTA_PYK2_PATHWAY biocarta
#> 3 0.6153846 0.4210526 BIOCARTA_CCR3_PATHWAY biocarta
#> 4 0.6153846 0.2962963 BIOCARTA_GH_PATHWAY biocarta
#> 5 0.5384615 0.4375000 BIOCARTA_CDMAC_PATHWAY biocarta
#> 6 0.6153846 0.2051282 BIOCARTA_FCER1_PATHWAY biocarta
#> term_name
#> 1 http://www.gsea-msigdb.org/gsea/msigdb/cards/BIOCARTA_CXCR4_PATHWAY
#> 2 http://www.gsea-msigdb.org/gsea/msigdb/cards/BIOCARTA_PYK2_PATHWAY
#> 3 http://www.gsea-msigdb.org/gsea/msigdb/cards/BIOCARTA_CCR3_PATHWAY
#> 4 http://www.gsea-msigdb.org/gsea/msigdb/cards/BIOCARTA_GH_PATHWAY
#> 5 http://www.gsea-msigdb.org/gsea/msigdb/cards/BIOCARTA_CDMAC_PATHWAY
#> 6 http://www.gsea-msigdb.org/gsea/msigdb/cards/BIOCARTA_FCER1_PATHWAY
#> effective_domain_size source_order parents
#> 1 1476 47 NULL
#> 2 1476 109 NULL
#> 3 1476 31 NULL
#> 4 1476 78 NULL
#> 5 1476 27 NULL
#> 6 1476 71 NULL
There is no need to repeatedly upload the same GMT file(s) every time before the enrichment analysis. This can only be uploaded once and then the ID can be used in any further enrichment analyses that are based on that custom source. The same ID can also be used in the web tool as a token under the Custom GMT options. For example, the same query in the web tool is available from https://biit.cs.ut.ee/gplink/l/jh3HdbUWQZ.
Creating a Generic Enrichment Map (GEM) file for EnrichmentMap
Generic Enrichment Map (GEM) is a file format that can be used as an input for Cytoscape EnrichmentMap application. In EnrichmentMap you can set the Analysis Type parameter as Generic/gProfiler and upload the required files: GEM file with enrichment results (input field Enrichments) and GMT file that defines the annotations (input field GMT).
For a single query, the GEM file can be generated and saved using the following commands:
gostres <- gost(query = c("X:1000:1000000", "rs17396340", "GO:0005005", "ENSG00000156103", "NLRP1"),
evcodes = TRUE, multi_query = FALSE,
sources = c("GO", "REAC", "MIRNA", "CORUM", "HP", "HPA", "WP"))
gem <- gostres$result[,c("term_id", "term_name", "p_value", "intersection")]
colnames(gem) <- c("GO.ID", "Description", "p.Val", "Genes")
gem$FDR <- gem$p.Val
gem$Phenotype = "+1"
gem <- gem[,c("GO.ID", "Description", "p.Val", "FDR", "Phenotype", "Genes")]
head(gem)
#> GO.ID Description p.Val
#> 1 GO:0048013 ephrin receptor signaling pathway 8.590516e-26
#> 2 GO:0007411 axon guidance 2.744982e-18
#> 3 GO:0097485 neuron projection guidance 3.033447e-18
#> 4 GO:0007409 axonogenesis 5.608704e-15
#> 5 GO:0061564 axon development 1.885775e-14
#> 6 GO:0048667 cell morphogenesis involved in neuron differentiation 1.341072e-13
#> FDR Phenotype
#> 1 8.590516e-26 +1
#> 2 2.744982e-18 +1
#> 3 3.033447e-18 +1
#> 4 5.608704e-15 +1
#> 5 1.885775e-14 +1
#> 6 1.341072e-13 +1
#> Genes
#> 1 ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000183317,ENSG00000243364
#> 2 ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000183317,ENSG00000243364
#> 3 ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000183317,ENSG00000243364
#> 4 ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000183317,ENSG00000243364
#> 5 ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000183317,ENSG00000243364
#> 6 ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000183317,ENSG00000243364
Here you can replace the query
parameter with your own input. The parameter evcodes = TRUE
is necessary as it returns the column intersection with corresponding gene IDs that are annotated to the term.
Saving the file before uploading to Cytoscape:
Here the parameter file
should be the character string naming the file together with the path you want to save it to.
In addition to the GEM file, EnrichmentMap requires also the data source description GMT file as an input. For example, if you are using g:Profiler default data sources and your input query consists of human ENSG identifiers, then the required GMT file is available from https://biit.cs.ut.ee/gprofiler/static/gprofiler_full_hsapiens.ENSG.gmt. Note that this file does not include annotations from KEGG and Transfac as we are restricted by data source licenses that do not allow us to share these two data sources with our users. This means that the enrichment results in the GEM file cannot include results from these resources, otherwise you will get an error from the Cytoscape application. This can be assured by setting appropriate values to the sources
parameter in the gost()
function.
For other organisms, the GMT files are downloadable from the g:Profiler web page under the Data sources section, after setting a suitable value for the organism. If you are using a custom GMT file for you analysis, then this should be uploaded to EnrichmentMap.
In case you want to compare multiple queries in EnrichmentMap you could generate individual GEM files for each of the queries and upload these as separate Data sets. This EnrichmentMap option enables you to browse, edit and compare multiple networks simultaneously by color-coding different uploaded Data sets.
For example, these files can be generated with the following commands (note that the parameter is still set to multi_query = FALSE
):
# enrichment for two input gene lists
multi_gostres <- gost(query = list("chromX" = c("X:1000:1000000", "rs17396340",
"GO:0005005", "ENSG00000156103", "NLRP1"),
"chromY" = c("Y:1:10000000", "rs17396340",
"GO:0005005", "ENSG00000156103", "NLRP1")),
evcodes = TRUE, multi_query = FALSE,
sources = c("GO", "REAC", "MIRNA", "CORUM", "HP", "HPA", "WP"))
# format to GEM
gem <- multi_gostres$result[,c("query", "term_id", "term_name", "p_value", "intersection")]
colnames(gem) <- c("query", "GO.ID", "Description", "p.Val", "Genes")
gem$FDR <- gem$p.Val
gem$Phenotype = "+1"
# write separate files for queries
# install.packages("dplyr")
library(dplyr)
gem %>% group_by(query) %>%
group_walk(~
write.table(data.frame(.x[,c("GO.ID", "Description", "p.Val", "FDR", "Phenotype", "Genes")]),
file = paste0("gProfiler_", unique(.y$query), "_gem.txt"),
sep = "\t", quote = F, row.names = F))
Gene identifier conversion with gconvert
gconvert
enables to map between genes, proteins, microarray probes, common names, various database identifiers, etc, from numerous databases and for many species.
gconvert(query = c("REAC:R-HSA-3928664", "rs17396340", "NLRP1"), organism = "hsapiens",
target="ENSG", mthreshold = Inf, filter_na = TRUE)
#> input_number input target_number target name
#> 1 1 REAC:R-HSA-3928664 1.1 ENSG00000010810 FYN
#> 2 1 REAC:R-HSA-3928664 1.10 ENSG00000133216 EPHB2
#> 3 1 REAC:R-HSA-3928664 1.11 ENSG00000136238 RAC1
#> 4 1 REAC:R-HSA-3928664 1.12 ENSG00000137575 SDCBP
#> 5 1 REAC:R-HSA-3928664 1.13 ENSG00000149269 PAK1
#> 6 1 REAC:R-HSA-3928664 1.14 ENSG00000154928 EPHB1
#> 7 1 REAC:R-HSA-3928664 1.15 ENSG00000180370 PAK2
#> 8 1 REAC:R-HSA-3928664 1.16 ENSG00000182580 EPHB3
#> 9 1 REAC:R-HSA-3928664 1.17 ENSG00000196411 EPHB4
#> 10 1 REAC:R-HSA-3928664 1.2 ENSG00000071051 NCK2
#> 11 1 REAC:R-HSA-3928664 1.3 ENSG00000077264 PAK3
#> 12 1 REAC:R-HSA-3928664 1.4 ENSG00000090776 EFNB1
#> 13 1 REAC:R-HSA-3928664 1.5 ENSG00000101608 MYL12A
#> 14 1 REAC:R-HSA-3928664 1.6 ENSG00000102606 ARHGEF7
#> 15 1 REAC:R-HSA-3928664 1.7 ENSG00000108262 GIT1
#> 16 1 REAC:R-HSA-3928664 1.8 ENSG00000108947 EFNB3
#> 17 1 REAC:R-HSA-3928664 1.9 ENSG00000125266 EFNB2
#> 18 2 rs17396340 2.1 ENSG00000054523 KIF1B
#> 19 3 NLRP1 3.1 ENSG00000091592 NLRP1
#> description
#> 1 FYN proto-oncogene, Src family tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:4037]
#> 2 EPH receptor B2 [Source:HGNC Symbol;Acc:HGNC:3393]
#> 3 Rac family small GTPase 1 [Source:HGNC Symbol;Acc:HGNC:9801]
#> 4 syndecan binding protein [Source:HGNC Symbol;Acc:HGNC:10662]
#> 5 p21 (RAC1) activated kinase 1 [Source:HGNC Symbol;Acc:HGNC:8590]
#> 6 EPH receptor B1 [Source:HGNC Symbol;Acc:HGNC:3392]
#> 7 p21 (RAC1) activated kinase 2 [Source:HGNC Symbol;Acc:HGNC:8591]
#> 8 EPH receptor B3 [Source:HGNC Symbol;Acc:HGNC:3394]
#> 9 EPH receptor B4 [Source:HGNC Symbol;Acc:HGNC:3395]
#> 10 NCK adaptor protein 2 [Source:HGNC Symbol;Acc:HGNC:7665]
#> 11 p21 (RAC1) activated kinase 3 [Source:HGNC Symbol;Acc:HGNC:8592]
#> 12 ephrin B1 [Source:HGNC Symbol;Acc:HGNC:3226]
#> 13 myosin light chain 12A [Source:HGNC Symbol;Acc:HGNC:16701]
#> 14 Rho guanine nucleotide exchange factor 7 [Source:HGNC Symbol;Acc:HGNC:15607]
#> 15 GIT ArfGAP 1 [Source:HGNC Symbol;Acc:HGNC:4272]
#> 16 ephrin B3 [Source:HGNC Symbol;Acc:HGNC:3228]
#> 17 ephrin B2 [Source:HGNC Symbol;Acc:HGNC:3227]
#> 18 kinesin family member 1B [Source:HGNC Symbol;Acc:HGNC:16636]
#> 19 NLR family pyrin domain containing 1 [Source:HGNC Symbol;Acc:HGNC:14374]
#> namespace
#> 1 REAC
#> 2 REAC
#> 3 REAC
#> 4 REAC
#> 5 REAC
#> 6 REAC
#> 7 REAC
#> 8 REAC
#> 9 REAC
#> 10 REAC
#> 11 REAC
#> 12 REAC
#> 13 REAC
#> 14 REAC
#> 15 REAC
#> 16 REAC
#> 17 REAC
#> 18
#> 19 ENTREZGENE,HGNC,UNIPROT_GN,WIKIGENE
Default target = ENSG
database is Ensembl ENSG, but gconvert
also supports other major naming conventions like Uniprot, RefSeq, Entrez, HUGO, HGNC and many more. In addition, a large variety of microarray platforms like Affymetrix, Illumina and Celera are available.
The parameter mthreshold
sets the maximum number of results per initial alias. Shows all results by default. The parameter filter_na = TRUE
will exclude the results without any corresponding targets.
The result is a data.frame
with columns:
- input_number - the order of the input in the input list
- input - input identifier
- target_number - the number that depicts the input number and the order of target for that input. One input can have several targets. For examaple, 1.2 stands for the second target of first input.
- target - corresponding identifier from the selected
target
namespace - name - name of the target gene
- description - description of the target gene
- namespace - namespaces where the initial input is available
SNP identifier conversion to gene name with gsnpense
gsnpense
converts a list of SNP rs-codes (e.g. rs11734132) to chromosomal coordinates, gene names and predicted variant effects. Mapping is only available for variants that overlap with at least one protein coding Ensembl gene.
gsnpense(query = c("rs11734132", "rs4305276", "rs17396340", "rs3184504"),
filter_na = TRUE)
#> rs_id chromosome start end strand ensgs gene_names
#> 1 rs4305276 2 240555596 240555596 + ENSG00000144504 ANKMY1
#> 2 rs17396340 1 10226118 10226118 + ENSG00000054523 KIF1B
#> 3 rs3184504 12 111446804 111446804 + ENSG00000111252 SH2B3
#> 4 rs3184504 12 111446804 111446804 + ENSG00000204842 ATXN2
#> variants.NMD_transcript_variant variants.intron_variant
#> 1 0 57
#> 2 0 8
#> 3 3 3
#> 4 3 3
#> variants.missense_variant variants.non_coding_transcript_variant
#> 1 0 18
#> 2 0 0
#> 3 12 0
#> 4 12 0
The parameter filter_na = TRUE
will exclude the results without any corresponding target genes.
gsnpense(query = c("rs11734132", "rs4305276", "rs17396340", "rs3184504"),
filter_na = FALSE)
#> rs_id chromosome start end strand ensgs gene_names
#> 1 rs11734132 <NA> NA NA <NA> <NA> <NA>
#> 2 rs4305276 2 240555596 240555596 + ENSG00000144504 ANKMY1
#> 3 rs17396340 1 10226118 10226118 + ENSG00000054523 KIF1B
#> 4 rs3184504 12 111446804 111446804 + ENSG00000111252 SH2B3
#> 5 rs3184504 12 111446804 111446804 + ENSG00000204842 ATXN2
#> variants.NMD_transcript_variant variants.intron_variant
#> 1 NA NA
#> 2 0 57
#> 3 0 8
#> 4 3 3
#> 5 3 3
#> variants.missense_variant variants.non_coding_transcript_variant
#> 1 NA NA
#> 2 0 18
#> 3 0 0
#> 4 12 0
#> 5 12 0
The result is a data.frame
with columns:
- rs_id - input SNP rs-code
- chromosome - SNP chromosome
- start - SNP start position
- end - SNP end position
- strand - SNP strand (+/-)
- ensgs - corresponding Ensembl gene IDs; can contain a
list
with multiple values - gene_names - corresponding gene names; can contain a
list
with multiple values - variants - a
data.frame
with corresponding variant effects
Accessing archived versions or the beta release with set_base_url
You can change the underlying tool version to beta with:
You can check the current version with:
Similarly, for the archived versions:
Note that gprofiler2 package is only compatible with versions e94_eg41_p11 and higher.
Supported organisms and identifier namespaces
gprofiler2 package supports all the same organisms, namespaces and data sources as the web tool. The list of organisms and corresponding data sources is available here.
The full list of namespaces that g:Profiler recognizes is available here.
Citation
If you use the R package gprofiler2 in published research, please cite:
Uku Raudvere, Liis Kolberg, Ivan Kuzmin, Tambet Arak, Priit Adler, Hedi Peterson, Jaak Vilo: g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update) Nucleic Acids Research 2019; doi:10.1093/nar/gkz369
Need help?
If you have questions or issues, please write to biit.support@ut.ee
References
1. Raudvere U, Kolberg L, Kuzmin I, Arak T, Adler P, Peterson H, et al. G: Profiler: A web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic acids research. 2019;47:W191–8.