Introduction to ontologySimilarity

Daniel Greene

2021-02-10

ontologySimilarity is part of the 'ontologyX' family of packages (see the 'Introduction to ontologyX' vignette supplied with the ontologyIndex package). It contains various functions for calculating semantic similarity between ontological objects. The functions operate on various kinds of object. It's useful to look out for particular parameter names, as each kind of object tends to be called the same thing by the functions. To make full use of the features in ontologySimilarity, the user is encouraged to gain familiarity of the functions in ontologyIndex.

Various kinds of similarity can be calculated, including:

Some key functions are:

Example

To use the package, first load ontologyIndex and an ontology_index object. Here we demonstrate using the Human Phenotype Ontology, hpo.

library(ontologyIndex)
library(ontologySimilarity)
data(hpo)
set.seed(1)

Next, we'll set the information content for the terms. This is typically based on some kind of 'population frequency', for example: the frequency with which the term is used, explicitly or implicity, to annotate objects in a database. Such frequency information is not always available, but it could still be useful to define the information content with respect to the frequency with which the term is an ancestor of other terms in the ontology (as this still captures the structure of the ontology).

information_content <- descendants_IC(hpo)

Now we'll generate some random sets of terms. We'll sample 5 random term sets (which could for example represent the phenotypes of patients) of 8 terms. Note that here, we call the minimal_set function from the ontologyIndex package on each sample set to remove redundant terms. Typically, ontological annotations would be stored as such minimal sets, however if you are unsure, it is best to call minimal_set on each term set to guarantee the similarity expressions are faithfully evaluated (the package chooses not to map to minimal sets by default for speed).

term_sets <- replicate(simplify=FALSE, n=5, expr=minimal_set(hpo, sample(hpo$id, size=8)))
term_sets
## [[1]]
## [1] "HP:0001315" "HP:0011343" "HP:0007164" "HP:0030810" "HP:0030133"
## [6] "HP:0040082" "HP:0011802" "HP:0005875"
## 
## [[2]]
## [1] "HP:0100828" "HP:0012133" "HP:0001730" "HP:0011863" "HP:0002385"
## [6] "HP:0030361" "HP:0011569" "HP:0001476"
## 
## [[3]]
## [1] "HP:0012266" "HP:0001100" "HP:0012021" "HP:0011182" "HP:0009140"
## [6] "HP:0009790" "HP:0007656" "HP:0100261"
## 
## [[4]]
## [1] "HP:0003970" "HP:0010375" "HP:0040115" "HP:0006915" "HP:0030045"
## [6] "HP:0002898" "HP:0004920" "HP:0002621"
## 
## [[5]]
## [1] "HP:0003481" "HP:0002033" "HP:0010603" "HP:0006886" "HP:0009783"
## [6] "HP:0006415" "HP:0001634" "HP:0040167"

Then one can calculate a similarity matrix, containing pairwise term-set similarities:

sim_mat <- get_sim_grid(ontology=hpo, term_sets=term_sets)
sim_mat
##           [,1]      [,2]      [,3]      [,4]      [,5]
## [1,] 1.0000000 0.2446414 0.2077991 0.2002343 0.2654494
## [2,] 0.2446414 1.0000000 0.2079281 0.2604863 0.2292824
## [3,] 0.2077991 0.2079281 1.0000000 0.2385239 0.2144670
## [4,] 0.2002343 0.2604863 0.2385239 1.0000000 0.2450322
## [5,] 0.2654494 0.2292824 0.2144670 0.2450322 1.0000000

Group similarity of phenotypes 1-3, based on sim_mat:

get_sim(sim_mat, group=1:3)
## [1] 0.2201229

p-value for significance of similarity of phenotypes 1-3:

get_sim_p(sim_mat, group=1:3)
## [1] 0.8001998