gsEasy
has a function gset
for calculating p-values of enrichment for sets (of genes) in ranked/scored lists (of genes) by permutation (see ‘Gene Set Enrichment Analysis’ described by Subramanian et al, 2005). The arguments of gset
are named as in the paper:
N
: the total number of genes,S
: integer
vector giving the ranks of the genes in the test set amongst the N
, or giving the indices within the scores vector r
(see below) or a character
vector of the names of genes in the test set,r
: (optional) vector of length N
of correlation scores, e.g. gene expression correlation. If unspecified, it defaults to 1-(i-1)/N
for the i
th gene. If S
is given as the names of genes, r
must either be a character
vector of genes in rank order or named by genes
(necessarily containing all the genes in S
).p
: a numeric value used to exponentiate the enrichment scores given by r
, with higher values having the effect of increasing the weight on the highest scores/ranks (for more details, see Subramanian et al, 2005). The default value is 1
[i.e. not transformed].Say we had a set of 5 genes which appeared at the top five ranks out of 1000 (i.e. highly enriched at the high ranks!). We could then calculate an enrichment p-value using the command:
gset(S=1:5, N=1000)
## [1] 9.9999e-06
So the p-value is close to zero. However for random sets, the p-values are distributed uniformly:
replicate(n=10, expr=gset(S=sample.int(n=1000, size=5), N=1000))
## [1] 0.9303483 0.2039801 0.9004975 0.5472637 0.2736318 0.7014925 0.1791045
## [8] 0.9950249 0.6517413 0.9502488
Alternatively, you can pass the names of genes as S
with a sorted list of gene names as r
(in which case the scores default to the ranks in the list), or a numeric vector of scores named by genes as r
.
gset(S=c("gene 1", "gene 5", "gene 40"), r=paste("gene", 1:100))
## [1] 0.07587383
Multiple gene sets can thus be tested for enrichment with a single call to a high level function such as sapply
(or, if you have many sets to test and multiple cores available, mclapply
), for instance:
gene_sets <- c(list(1:5), replicate(n=10, simplify=FALSE, expr=sample.int(n=1000, size=5)))
names(gene_sets) <- c("enriched set", paste("unenriched set", 1:10))
gene_sets
## $`enriched set`
## [1] 1 2 3 4 5
##
## $`unenriched set 1`
## [1] 922 424 840 133 849
##
## $`unenriched set 2`
## [1] 110 297 737 702 284
##
## $`unenriched set 3`
## [1] 257 168 249 999 411
##
## $`unenriched set 4`
## [1] 23 663 500 876 31
##
## $`unenriched set 5`
## [1] 111 896 219 923 888
##
## $`unenriched set 6`
## [1] 19 143 590 107 588
##
## $`unenriched set 7`
## [1] 223 402 696 273 102
##
## $`unenriched set 8`
## [1] 187 242 200 458 531
##
## $`unenriched set 9`
## [1] 39 407 709 188 235
##
## $`unenriched set 10`
## [1] 93 238 820 675 368
sapply(gene_sets, function(set) gset(S=set, N=1000))
## enriched set unenriched set 1 unenriched set 2 unenriched set 3
## 0.0000099999 0.6567164179 0.3482587065 0.1194029851
## unenriched set 4 unenriched set 5 unenriched set 6 unenriched set 7
## 0.0778846154 0.0852359209 0.0914893617 0.2537313433
## unenriched set 8 unenriched set 9 unenriched set 10
## 0.4079601990 0.2985074627 0.4676616915
gsEasy
has a function get_ontological_gene_sets
for creating lists of gene sets corresponding to annotation with ontological terms such that ontological is-a relations are propagated. get_ontological_gene_sets
accepts an ontological_index
(see the R package ontologyIndex
for more details) argument and two character vectors, corresponding to genes and terms respectively, whereby the n-th element in each vector corresponds to one annotation pair. The result, a list of character vectors of gene names, can then be used as an argument of gset
.
library(ontologyIndex)
data(hpo)
df <- data.frame(
gene=c("gene 1", "gene 2"),
term=c("HP:0000598", "HP:0000118"),
name=hpo$name[c("HP:0000598", "HP:0000118")],
stringsAsFactors=FALSE,
row.names=NULL)
df
## gene term name
## 1 gene 1 HP:0000598 Abnormality of the ear
## 2 gene 2 HP:0000118 Phenotypic abnormality
get_ontological_gene_sets(hpo, gene=df$gene, term=df$term)
## $`HP:0000001`
## [1] "gene 1" "gene 2"
##
## $`HP:0000118`
## [1] "gene 1" "gene 2"
##
## $`HP:0000598`
## [1] "gene 1"
gsEasy
comes with a list
of GO annotations, GO_gene_sets
[based on annotations downloaded from geneontology.org on 07/08/2016], which can be loaded with data
. This comprises a list
of all gene sets (i.e. character
vectors of gene names) associated with each GO term, for GO terms being annotated with at most 500 genes.
data(GO_gene_sets)
GO_gene_sets[1:6]
## $`GO:0000002`
## [1] "AKT3" "LONP1" "MEF2A" "MGME1" "MPV17" "MRPL17"
## [7] "MRPL39" "OPA1" "PIF1" "SLC25A33" "SLC25A36" "SLC25A4"
## [13] "TYMP"
##
## $`GO:0000003`
## [1] "EIF4H" "IL12B" "LEP" "LEPR" "MMP23A" "RHOXF1" "SEPP1"
## [8] "STAT3" "TNP1" "VGF" "WDR43"
##
## $`GO:0000009`
## [1] "ALG12"
##
## $`GO:0000010`
## [1] "PDSS1" "PDSS2"
##
## $`GO:0000011`
## [1] "RBSN"
##
## $`GO:0000012`
## [1] "APLF" "APTX" "E9PQ18" "LIG4" "M0R2N6" "Q6ZNB5" "SIRT1"
## [8] "TDP1" "TNP1" "XRCC1"
It also has a function get_GO_gene_sets
which is a specialisation of get_ontological_gene_sets
for the Gene Ontology (GO) which can be called passing just a file path to the annotation file (official up-to-date version available at http://geneontology.org/).