This is a long post; I am trying to find out why R's GO.db
gives different results than Python's go_db
https://github.com/endrebak/go_db.
go_db
seems to give different results than the R package GO.db.
go_db
computes ancestors by looking upwards in the tree; so that C is an ancestor of A if A is a C, A is part of C or C has part A or any such relation exists transitively.
Since go_db
is a dinky script I threw together quickly and GO.db is the seemingly canonical Gene Ontology package I am guessing I am in the wrong, but if anyone can explain why, I would appreciate it. It is probably my usage of GO.db that is incorrect, I'd wager.
To run the scripts below you need to install wide_diapeR
and go_db
with
pip install -U go_db wide_diaper
and have the R package GO.db installed
from __future__ import print_function
from widediaper import R
from go_db import get_offspring
r = R()
r.load_library("GO.db")
r("cc_offspring = as.data.frame(GOCCOFFSPRING)")
r_cc_offspring = r.get("cc_offspring")
r_cc_offspring.columns = ["Offspring", "Parent"]
py_cc_offspring = get_offspring("CC")
py_go_0001533 = py_cc_offspring.loc[py_cc_offspring.Offspring == "GO:0001533"]
r_go_0001533 = r_cc_offspring.loc[r_cc_offspring.Offspring == "GO:0001533"]
# Parents in GO.db, but not go_db
set(r_go_0001533.Parent) - set(py_go_0001533.Parent)
# {'GO:0005622',
# 'GO:0005856',
# 'GO:0043226',
# 'GO:0043228',
# 'GO:0043229',
# 'GO:0043232',
# 'GO:0044424'}
# Parents in go_db, but not GO.db
set(py_go_0001533.Parent) - set(r_go_0001533.Parent)
# {'GO:0005886',
# 'GO:0012505',
# 'GO:0016020',
# 'GO:0030313',
# 'GO:0031975',
# 'GO:0071944',
# 'GO:0097653'}
As you can see above, there are great differences in what nodes the two packages consider to be ancestors of "GO:0001533".
Let us look at the go.obo file and see if we can make sense of this. The script below prints out all the ancestors of a node (including the node itself):
# download obo file from http://purl.obolibrary.org/obo/go.obo
obo_file = '/Users/labsenter/.go_db/go.obo' # Use the path to whereever you are storing your .obo file
obo_records = "".join(open(obo_file).readlines()).split("\n\n[Term]")[1:-1]
py_ancestors = py_cc_offspring.loc[py_cc_offspring.Offspring == "GO:0001533"].Parent
py_ancestors.append("GO:0001533")
py_ancestor_search_terms = ["id: " + ancestor for ancestor in py_ancestors]
r_ancestors = list(py_cc_offspring.loc[py_cc_offspring.Offspring == "GO:0001533"].Parent)
r_ancestors.append("GO:0001533")
r_ancestor_search_terms = ["id: " + ancestor for ancestor in r_ancestors]
def _print_records(search_terms):
for record in obo_records:
for search_term in search_terms:
if search_term in record:
for line in record.split("\n"):
for term in ["id:", "is_a:", "relationship:", "intersection_of:"]:
if term in line:
print(line)
print()
print("### Py ancestors\n")
_print_records(py_ancestor_search_terms)
print("### R ancestors\n")
_print_records(r_ancestor_search_terms)
The script above gives the following results:
### Py ancestors
id: GO:0001533
is_a: GO:0005886 ! plasma membrane
id: GO:0005575
alt_id: GO:0008372
id: GO:0005623
is_a: GO:0005575 ! cellular_component
id: GO:0005886
alt_id: GO:0005904
is_a: GO:0016020 ! membrane
is_a: GO:0044464 ! cell part
relationship: part_of GO:0071944 ! cell periphery
id: GO:0012505
is_a: GO:0044464 ! cell part
relationship: has_part GO:0005773 ! vacuole
relationship: has_part GO:0005886 ! plasma membrane
id: GO:0016020
is_a: GO:0005575 ! cellular_component
id: GO:0030313
is_a: GO:0031975 ! envelope
relationship: has_part GO:0005886 ! plasma membrane
id: GO:0031975
is_a: GO:0044464 ! cell part
id: GO:0044464
is_a: GO:0005575 ! cellular_component
intersection_of: GO:0005575 ! cellular_component
intersection_of: part_of GO:0005623 ! cell
relationship: part_of GO:0005623 ! cell
id: GO:0071944
is_a: GO:0044464 ! cell part
id: GO:0097653
is_a: GO:0044464 ! cell part
relationship: has_part GO:0005622 ! intracellular
relationship: has_part GO:0005886 ! plasma membrane
### R ancestors
id: GO:0001533
is_a: GO:0005886 ! plasma membrane
id: GO:0005575
alt_id: GO:0008372
id: GO:0005623
is_a: GO:0005575 ! cellular_component
id: GO:0005886
alt_id: GO:0005904
is_a: GO:0016020 ! membrane
is_a: GO:0044464 ! cell part
relationship: part_of GO:0071944 ! cell periphery
id: GO:0012505
is_a: GO:0044464 ! cell part
relationship: has_part GO:0005773 ! vacuole
relationship: has_part GO:0005886 ! plasma membrane
id: GO:0016020
is_a: GO:0005575 ! cellular_component
id: GO:0030313
is_a: GO:0031975 ! envelope
relationship: has_part GO:0005886 ! plasma membrane
id: GO:0031975
is_a: GO:0044464 ! cell part
id: GO:0044464
is_a: GO:0005575 ! cellular_component
intersection_of: GO:0005575 ! cellular_component
intersection_of: part_of GO:0005623 ! cell
relationship: part_of GO:0005623 ! cell
id: GO:0071944
is_a: GO:0044464 ! cell part
id: GO:0097653
is_a: GO:0044464 ! cell part
relationship: has_part GO:0005622 ! intracellular
relationship: has_part GO:0005886 ! plasma membrane
Since you can see that GO:0001533
is a GO:0005886
which again is a GO:0016020
, GO:0016020
should be considered an ancestor of GO:0001533
. Unfortunately, that is not the case in GO.db
. If anyone knows why, please explain!