I'm trying to understand the logic behind the Gene Ontology annotations.
Let's take one gene, for example: ENSG00000198570
. When passed into BioMart, it tells me there are three GO term accession IDs. Visualized within the GO tree, they look like this:
All three of them are offspring of biological_process
; two of them are offspring of single-organism process
.
Ultimately, I want to be able to analyze quite a big set of genes and see whether they cluster into big and/or small groups with same function. Therefore, e.g., if both are reported as offspring of protein binding
, I would be able to immediately know that they are protein binding
and biological_process
themselves.
Right now the only option seems to be to traverse the GeneOntology XML, bottom to top, for each GO term, but it's stupidly inefficient. Maybe there's something obvious I'm missing or there's a piece of software out there that can do just what I need?
I hope what I'm saying is making sense to you.
You're right; including parent vertices would bulk up the database considerably... The source of my confusion was probably the fact that for some genes in my dataset it did report just stuff like "growth" or "protein binding", not some more specific function, so at first I thought that it would report parent vertices as well.
Thanks for the links!
I have edited the original question a bit in the wake of your answer... Just in case someone else comes along and proposes a rival solution.
I suggest that the situation you've described with reporting more high-level categories for some genes could be explained as follows: it is harder and more evidence-demanding to assign a more concrete category to a gene, and the genes that are less-studied are classified with a lower specificity