In an R session, I have a data frame with all the Escherichia coli genes and their associated GO terms. Each gene is annotated with one GO term only, representing the deepest annotation level. I then have a character vector of specific GO terms that our collaborators are interested in for their work.
I would like to extract all the genes from the first data frame that are associated with the GO terms in the character vector.
When I say "associated" I mean either carrying a GO term that is found in the vector, or a children of that term. An example: one of the GO terms in the vector is "cell death", but a gene is likely to be annotated with something much more specific, that is a child term of "cell death".
I have GO.db
installed but I'm not at all proof with it, since it's the first time I do this. How do I properly carry on this task?
Currently, my strategy would be the following:
- With each GO term in the character vector, extract all its children terms using the
GO.db
package. unlist()
the results into a single character vector containing all initial GO terms and their children.- Extract all genes from the data frame whose associated GO term matches any of the found GO terms / children GO terms.
Would this be the most strategic approach? They are ~ 30 GO terms, and for each I have to extract all its children terms. Sounds like it's gonna be a huge output list.