gsEasy

Calculate p-values for enrichment of set

gsEasy has a function gset for calculating p-values of enrichment for sets (of genes) in ranked/scored lists (of genes) by permutation (see 'Gene Set Enrichment Analysis' described by Subramanian et al, 2005). The arguments of gset are named as in the paper:

N: the total number of genes,
S: integer vector giving the ranks of the genes in the test set amongst the N, or giving the indices within the scores vector r (see below) or a character vector of the names of genes in the test set,
r: (optional) vector of length N of correlation scores, e.g. gene expression correlation. If unspecified, it defaults to 1-(i-1)/N for the ith gene. If S is given as the names of genes, r must either be a character vector of genes in rank order or named by genes (necessarily containing all the genes in S).
p: a numeric value used to exponentiate the enrichment scores given by r, with higher values having the effect of increasing the weight on the highest scores/ranks (for more details, see Subramanian et al, 2005). The default value is 1 [i.e. not transformed].

Say we had a set of 5 genes which appeared at the top five ranks out of 1000 (i.e. highly enriched at the high ranks!). We could then calculate an enrichment p-value using the command:

gset(S=1:5, N=1000)

## [1] 9.9999e-06

So the p-value is close to zero. However for random sets, the p-values are distributed uniformly:

replicate(n=10, expr=gset(S=sample.int(n=1000, size=5), N=1000))

##  [1] 0.7213930 0.1111111 0.1058394 0.5870647 0.1142857 0.9850746 0.9701493
##  [8] 0.1840796 0.5074627 0.8905473

Alternatively, you can pass the names of genes as S with a sorted list of gene names as r (in which case the scores default to the ranks in the list), or a numeric vector of scores named by genes as r.

gset(S=c("gene 1", "gene 5", "gene 40"), r=paste("gene", 1:100))

## [1] 0.07792208

Multiple gene sets can thus be tested for enrichment with a single call to a high level function such as sapply (or, if you have many sets to test and multiple cores available, mclapply), for instance:

gene_sets <- c(list(1:5), replicate(n=10, simplify=FALSE, expr=sample.int(n=1000, size=5)))
names(gene_sets) <- c("enriched set", paste("unenriched set", 1:10))
gene_sets

## $`enriched set`
## [1] 1 2 3 4 5
## 
## $`unenriched set 1`
## [1]   6  53 974 143 936
## 
## $`unenriched set 2`
## [1] 438 237 252 458 293
## 
## $`unenriched set 3`
## [1] 624 680 778 230 727
## 
## $`unenriched set 4`
## [1] 997 254 714 307 513
## 
## $`unenriched set 5`
## [1] 104 786 263 857 632
## 
## $`unenriched set 6`
## [1] 149 970 961  73 265
## 
## $`unenriched set 7`
## [1] 435 138 982 387 722
## 
## $`unenriched set 8`
## [1] 667 441 506 186 913
## 
## $`unenriched set 9`
## [1] 791 911 638 468 274
## 
## $`unenriched set 10`
## [1] 758 626 817 698  45

sapply(gene_sets, function(set) gset(S=set, N=1000))

##      enriched set  unenriched set 1  unenriched set 2  unenriched set 3 
##      0.0000099999      0.0017199828      0.2388059701      0.9353233831 
##  unenriched set 4  unenriched set 5  unenriched set 6  unenriched set 7 
##      0.7562189055      0.5024875622      0.0247097529      0.4228855721 
##  unenriched set 8  unenriched set 9 unenriched set 10 
##      0.7462686567      0.9054726368      0.5124378109

Ontological annotations

gsEasy has a function get_ontological_gene_sets for creating lists of gene sets corresponding to annotation with ontological terms such that ontological is-a relations are propagated. get_ontological_gene_sets accepts an ontological_index (see the R package ontologyIndex for more details) argument and two character vectors, corresponding to genes and terms respectively, whereby the n-th element in each vector corresponds to one annotation pair. The result, a list of character vectors of gene names, can then be used as an argument of gset.

library(ontologyIndex)
data(hpo)
df <- data.frame(
    gene=c("gene 1", "gene 2"), 
    term=c("HP:0000598", "HP:0000118"), 
    name=hpo$name[c("HP:0000598", "HP:0000118")], 
    stringsAsFactors=FALSE,
    row.names=NULL)
df

##     gene       term                   name
## 1 gene 1 HP:0000598 Abnormality of the ear
## 2 gene 2 HP:0000118 Phenotypic abnormality

get_ontological_gene_sets(hpo, gene=df$gene, term=df$term)

## $`HP:0000001`
## [1] "gene 1" "gene 2"
## 
## $`HP:0000118`
## [1] "gene 1" "gene 2"
## 
## $`HP:0000598`
## [1] "gene 1"

Gene Ontology (GO) annotations

gsEasy comes with a list of GO annotations, GO_gene_sets [based on annotations downloaded from geneontology.org on 07/08/2016], which can be loaded with data. This comprises a list of all gene sets (i.e. character vectors of gene names) associated with each GO term, for GO terms being annotated with at most 500 genes.

data(GO_gene_sets)
GO_gene_sets[1:6]

## $`GO:0000002`
##  [1] "AKT3"     "LONP1"    "MEF2A"    "MGME1"    "MPV17"    "MRPL17"  
##  [7] "MRPL39"   "OPA1"     "PIF1"     "SLC25A33" "SLC25A36" "SLC25A4" 
## [13] "TYMP"    
## 
## $`GO:0000003`
##  [1] "EIF4H"  "IL12B"  "LEP"    "LEPR"   "MMP23A" "RHOXF1" "SEPP1"  "STAT3" 
##  [9] "TNP1"   "VGF"    "WDR43" 
## 
## $`GO:0000009`
## [1] "ALG12"
## 
## $`GO:0000010`
## [1] "PDSS1" "PDSS2"
## 
## $`GO:0000011`
## [1] "RBSN"
## 
## $`GO:0000012`
##  [1] "APLF"   "APTX"   "E9PQ18" "LIG4"   "M0R2N6" "Q6ZNB5" "SIRT1"  "TDP1"  
##  [9] "TNP1"   "XRCC1"

It also has a function get_GO_gene_sets which is a specialisation of get_ontological_gene_sets for the Gene Ontology (GO) which can be called passing just a file path to the annotation file (official up-to-date version available at http://geneontology.org/).

gsEasy

Daniel Greene

2021-05-26

Calculate p-values for enrichment of set

Ontological annotations

Gene Ontology (GO) annotations