Question

is it meaningful to enrich an organism using ontology from a different organism?

1

Entering edit mode

4.7 years ago

tbg ▴ 120

Suppose I have a list of genes obtained from an experiment using mice. Now I have to perform an enrichment and I can choose to perform it using mouse GO terms, human GO terms, etc...

Is it ok to use human GO terms over mice genes?

If yes, then why bother creating specific mouse GO terms? I understand that mice may be used as a human model for some diseases but then...why are the two set of GO terms different?

If I can apply human GO terms over mice genes, should I perform enrichment analysis using zebrafish GO terms over mice genes (or human genes!), if those genes share a certain degree of similarity (such as homology)?

EDIT: I posted the same question in stack exchange bioinformatics but no answers there.

enrichment geneontology • 2.1k views

ADD COMMENT • link updated 4.7 years ago by Papyrus ★ 3.1k • written 4.7 years ago by tbg ▴ 120

score 3 · Accepted Answer · 2020-11-12

3

Entering edit mode

4.7 years ago

Papyrus ★ 3.1k

Gene sets such as GO terms are developed by integrating the current scientific knowledge about the function of genes. Even in the case when you use human GO terms in human, many of the functions of those human genes which are "summarized" by the human ontologies, have actually been (at least in part) first discovered in other model organisms such as mouse. Thus, even if you look at them from the human point of view, they may reflect general knowledge which applies to many organisms.

In my opinion, this is a fuzzy characteristic of gene ontologies. All in all, a good strategy can be, when you use gene ontologies in another organism, to assume that the genes that are homologous/orthologous between the organisms share the same functions. This may sometimes not be the case, but I believe that the typical pathway enrichment analyses discover "general trends" which may be not be so impacted by particular exceptions.

Personally I am fond of the msigdbr package, which provides gene sets from the MSigDB (including GOs, KEGG, Reactome...) for different organisms. It is built by, starting from the human terms containing human genes, translating them to other species by selecting the orthologs as defined by the HUGO Gene Nomenclature Committee.

ADD COMMENT • link 4.7 years ago by Papyrus ★ 3.1k

0

Entering edit mode

The MSigDB gene sets use human genes. However, many are based on mouse and rat studies (check the "organism" field for the individual gene sets). These were then converted to human symbols by the MSigDB team, so they consider the pathways to be sufficiently similar across those species.

ADD REPLY • link 4.7 years ago by igor 13k

0

Entering edit mode

So basically they just did a symbol transoformation from one organism to another without considering homology/orthology but just assuming that it is ok to do so, right? Is this meaningful? I suppose it can be since mouse is used as animal model but eventually things that work out for mice do not always work out for humans too hence it seems that using enrichment analysis is kind of...not reliable?

ADD REPLY • link 4.7 years ago by tbg ▴ 120

1

Entering edit mode

using enrichment analysis is kind of...not reliable

Some would argues that is the case even for the same species, but that would be an entirely separate topic.

ADD REPLY • link 4.7 years ago by igor 13k

0

Entering edit mode

can I ask if you have some references regarding the topic?

ADD REPLY • link 4.7 years ago by tbg ▴ 120

0

Entering edit mode

I cannot recall any particular publication.

A lot of pathways (for example MSigDB C2) are based on a single study. Although I do not doubt that many are reliable, we also know that many are not reproducible. Additionally, even in a well-executed study, most genes are usually not validated, either in an independent cohort or with an alternate computational analysis.

ADD REPLY • link 4.7 years ago by igor 13k

1

Entering edit mode

I agree, I would say that no matter the final strategy (if any) concerning pathway enrichment analysis, it is better to treat the results as indicative trends rather than demonstrated truths about any pathway being affected.

ADD REPLY • link 4.7 years ago by Papyrus ★ 3.1k

0

Entering edit mode

Ok, so basically whenever I need to perform an enrichment I should verify if there is a specific amount of similarity between the genes in the different organism. If this similarity is not significant enough I must assume that the enrichment will not provide meaningful results. Is that correct?

ADD REPLY • link 4.7 years ago by tbg ▴ 120

0

Entering edit mode

What I mean is that a possible strategy is to convert the human gene IDs into the other organism IDs, by retaining only those which are classified as homologous (using for example HUGO tables from the HCOP orthology tool, which incorporates info on many homology tools) and try to do the enrichment in those converted pathways. I think that this will work better with annotations such as GOs, because these are more general and encompass knowledge for different organisms. Other more human-specific annotations within MSigDB may work less well. This blog entry on the subject is interesting, and mentions how far-away organisms may behave worse. Nonetheless, as stated, I think it will depend on the type of gene sets you are using and how they are initially derived and the knowledge they contain.

ADD REPLY • link 4.7 years ago by Papyrus ★ 3.1k

2

Entering edit mode

blog entry on the subject is interesting, and mentions how far-away organisms may behave worse

It seems one of their conclusions is that the C2 gene sets are substantially more reliable than the well-curated Hallmark ones which is highly counter-intuitive.

ADD REPLY • link 4.7 years ago by igor 13k

2

Entering edit mode

You're right, I admit I skimmed through the post and had not noticed that it compared H sets to C2 sets. And it is indeed counter-intuitive. Maybe the top C2 significant sets are very big (many genes) and thus suffer less (significance-wise) from losing non-homologous genes than the top H significant sets, some of which may even enter the 15-gene threshold and thus are removed?

It's hard to say if they are actually removing n<15 and n>500 sets like they mention in the beginning, but when performing enrichment in C2 sets I usually see bias for big sets (>100-200 genes) at the top significant results.

ADD REPLY • link 4.7 years ago by Papyrus ★ 3.1k

0

Entering edit mode

ok, that is clear, stick to GO and verify similarity between genes. Also, thanks for the link!

ADD REPLY • link 4.7 years ago by tbg ▴ 120