Semantic similarity of non-GO ontologies
1
1
Entering edit mode
10.1 years ago

I need to run semantic similarity searches on human phenotype ontology.

I am aware of packages like GoSim, SemSim etc. These packages works well with GO. I am looking for a package that can take any .obo file and run semantic similarities on top of it. Do you know of any packages that can do it?

Thanks in advance!

semantic-similarity ontology • 2.9k views
ADD COMMENT
2
Entering edit mode
10.1 years ago

This is what dnet package can do. See http://supfam.org/dnet/dDAGtermSim.html. However, in this example you need to replace with human phenotype ontology. More information about how to load different ontologies can be found in http://supfam.org/dnet/docs.html.

If you provide details about your annotation information, then probably I can provide more details.

ADD COMMENT
0
Entering edit mode

Thanks Hfang,

dnet is a great resource.

I was trying the example you have provided and have following queries.

  1. Why are you treating HPO as separate ontology ? I would prefer to use it as single ontology to run my queries. Is this possible?
  2. Can you help me to interpret the results from sim matrix? In the example the distance between same terms are given, but not different terms (example: HP:0000062 x HP:0000062 = 1.958076), but distance is given as "." for HP:0000062 x HP:0010931).
  3. Can you help with an example of DO similarity search; annotations=org.Hs.egDO is not working for me.

Thanks!

ADD REPLY
1
Entering edit mode

Hi Khader,

Below are the long answers for your questions. Hope they are useful.

1. HPO has three namespaces (sub-ontologies). This situation is very similar to GO and its sub-ontologies (Biological Process, Molecular Function, Cellular Component). For this reason, you have to calculate semantic similarity for each sub ontology, and then take their sum as your final semantic similarity. Alternatively, for HPO, usually the sub-ontology (Phenotypic Abnormality) is useful, and other two sub-ontologies (Mode of Inheritance; ONset and clinical course) are not well-defined.

2. First, make it clear that semantic similarity is a type of comparison to assess the degree of relatedness between two entities. It can be between two terms, but also can be between two genes annotated by terms. To do these, information content (IC) of a term is defined as the negative 10-based log-transformed frequency of genes annotated to that term. This definition considers the actual usage of a term (the frequency of annotated genes it has) to measure how specific and informative the term is. The function http://supfam.org/dnet/dDAGtermSim.html is to calculate semantic similarity between terms, which is then used by the function http://supfam.org/dnet/dDAGgeneSim.html to calculate semantic similarity between genes. When we are talking about semantic similarity between terms, the semantic similairty is NOT about their distance in the ontology hierarchy (actually organised as a DAG: directed acyclic graph without cycles). Depending on which methods to use, the meaning of semantic similarity can be different. If you choose the method 'Resnik', then semantic similarity is the information content (IC) at most informative common ancestor (MICA) of two terms (of your interest). MICA for HP:0000062 x HP:0000062 is HP:0000062 (who's IC is 1.958076). MICA for HP:0000062 x HP:0010931 is the root of ontology. Always, IC at the root is zero.

3. As for the Disease Ontology (DO), there is no sub-ontology. So the situation is very simple. Here is the code how to do it using dnet.

3a) if you are interested in the semantic similarity between DO terms.

# 1) load DO as igraph object (note: it is NOT part of the package built-in data, so you have to load via the function dRDataLoader)
ig.DO <- dRDataLoader(RData = "ig.DO")
g <- ig.DO

# 2) load human genes annotated by DO (note: it is NOT part of the package built-in data, so you have to load via the function dRDataLoader)
org.Hs.egDO <- dRDataLoader(RData = "org.Hs.egDO")

# 3) prepare for ontology and its annotation information
dag <- dDAGannotate(g, annotations="org.Hs.egDO", path.mode="all_paths", verbose=TRUE)

# 4) calculate pair-wise semantic similarity between 5 randomly chosen terms
terms <- sample(V(dag)$name, 5)
sim <- dDAGtermSim(g=dag, terms=terms, method="Schlicker", parallel=FALSE)
sim

3b) if you are interested in the semantic similarity between human genes (annotatable by DO terms).

# 1) load DO as igraph object (note: it is NOT part of the package built-in data, so you have to load via the function dRDataLoader)
ig.DO <- dRDataLoader(RData = "ig.DO")
g <- ig.DO

# 2) load human genes annotated by DO (note: it is NOT part of the package built-in data, so you have to load via the function dRDataLoader)
org.Hs.egDO <- dRDataLoader(RData = "org.Hs.egDO")

# 3) prepare for ontology and its annotation information
dag <- dDAGannotate(g, annotations="org.Hs.egDO", path.mode="all_paths", verbose=TRUE)

# 4) calculate pair-wise semantic similarity between 5 randomly chosen genes
allgenes <- unique(unlist(V(dag)$annotations))
genes <- sample(allgenes,5)
sim <- dDAGgeneSim(g=dag, genes=genes, method.gene="BM.average", method.term="Resnik", parallel=FALSE, verbose=TRUE)
sim
ADD REPLY
0
Entering edit mode

Hi Fang,

Thanks you for taking time out to explain with great clarity and your package is very, very well documented.

My requirement is a bit different. I have a set of DOIDs and HPOIDs - no gene association data; some of these phenotypes are mapped to non-coding regions.

P1 = HPOID:0100543, HPOID:0100543 
P2 = HPOID:0001250, HPOID:0001250

I need to get similarity between the sets of HPO or DO IDs using a function similar to GO term similarity methods

Example from GoSemSim here: 
> go1 = c("GO:0004022", "GO:0004024", "GO:0004174")
> go2 = c("GO:0009055", "GO:0005515")
> mgoSim(go1, go2, ont = "MF", measure = "Wang")
[1] 0.299

What I really want is a single number that provides a cumulative similarity score across all DO / HPO IDs instead of matrix of pair-wise similarities. Is there any option in dnet to get this out?

Even an implementation that would take a set of IDs like Input<-c("HPOID:0100543", "HPOID:0100543", "HPOID:0001250", "HPOID:0001250") and provide a similarity metric as output would also be very useful.

Thank you again for providing a very useful package.

ADD REPLY

Login before adding your answer.

Traffic: 2072 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6