I'm trying to get taxonomic lineage from UniProt with the following SPARQL query (based on this and this answers):
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix taxon: <http://purl.uniprot.org/taxonomy/>
prefix : <http://purl.uniprot.org/core/>
select ?ancestor ?name ?rank ?part_of_lineage
where {
taxon:9597 rdfs:subClassOf ?ancestor .
?ancestor :scientificName ?name ;
:partOfLineage ?part_of_lineage ;
:rank ?rank .
} order by ?rank
This query yields 14 entries:
ancestor name rank part_of_lineage
taxon:40674 Mammalia :Class true
taxon:9604 Hominidae :Family true
taxon:9596 Pan :Genus true
taxon:314293 Simiiformes :Infraorder false
taxon:33208 Metazoa :Kingdom true
taxon:9443 Primates :Order true
taxon:9526 Catarrhini :Parvorder true
taxon:7711 Chordata :Phylum true
taxon:207598 Homininae :Subfamily false
taxon:376913 Haplorrhini :Suborder true
taxon:89593 Craniata :Subphylum true
taxon:314295 Hominoidea :Superfamily false
taxon:2759 Eukaryota :Superkingdom true
taxon:314146 Euarchontoglires :Superorder true
You can try it with YASGUI.
Questions
Note, that unlike in the referred answer, I used
rdfs:subClassOf
without+
, because if I userdfs:subClassOf+
, I get this error message from UniProt:Exception:virtuoso.jdbc4.VirtuosoException: TN...: Exceeded 1000000000 bytes in transitive temp memory. use t_distinct, t_max or more T_MAX_memory options to limit the search or increase the pool
Is it a bug in their storage backend or I'm misusing
rdfs:subClassOf+
?As far as I understand, the
rdfs:subClassOf
relationship is _semantically_ transitive, but it should connect only directly related entities. So if you want to get direct ancestor, you can use it one, if you want to get all ancestors, you can use "property paths" feature withrdfs:subClassOf+
.But as far as I see from the results above and this query:
describe <http://purl.uniprot.org/taxonomy/9597> from <http://sparql.uniprot.org/taxonomy>
each node in the UniProt taxonomy graph is a subclass of many other nodes. Why is it so and how can I get just the direct parent of a given taxon in this situation?
Having many ancestors, is there a way to order them (using SPARQL, without postprocessing results) _by taxonomic rank_ (not lexicographically as in the above query)? This would solve the previous question.
If you open _Pan paniscus_
9597
from the example above on UniProt, you will see that its lineage is much longer, but some nodes in it are grey. How is this lineage on the UniProt website is related to the results of the query?If you check the NCBI Taxonomy, the _abbreviated_ lineage is also 14 nodes:
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Pan
But not all of them coincide! So what do I get in those results? Some random subset of the lineage?
Finally, what does the
:partOfLineage
property mean? Documentation says:True for taxa that can appear as part of an organism's lineage
But I don't understand what it means. Aren't all nodes parts of some lineage?
P.S. I read UniProt Taxonomy and Taxonomic lineage documentation. But it doesn't answer on my questions.
UPDATE Regarding my claim in the question 5.
Here is the lineage from UniProt (9597):
- ✔︎ Eukaryota
- ✘ Opisthokonta
- ✔︎ Metazoa
- ✘ Eumetazoa
- ✘ Bilateria
- ✘ Deuterostomia
- ✔︎ Chordata
- ✔︎ Craniata
- ✘ Vertebrata
- ✘ Gnathostomata
- ✘ Teleostomi
- ✘ Euteleostomi
- ✘ Sarcopterygii
- ✘ Dipnotetrapodomorpha
- ✘ Tetrapoda
- ✘ Amniota
- ✔︎ Mammalia
- ✘ Theria
- ✘ Eutheria
- ✘ Boreoeutheria
- ✔︎ Euarchontoglires
- ✔︎ Primates
- ✔︎ Haplorrhini
- ✔︎ Simiiformes
- ✔︎ Catarrhini
- ✔︎ Hominoidea
- ✔︎ Hominidae
- ✔︎ Homininae
- ✔︎ Pan
Those in bold are the ones with :partOfLineage true
. The checkmarks/crosses on the left mean that this taxon is present/absent in the query result. Note, that it contains both types of nodes (not only from the abbreviated linage).
Thank you for the answers! I also sent a couple of emails to
help@uniprot.org
, but I just wasn't sure which way to contact you is more effective. I have some subsequent questions:I thought that the original reason may be related to some technical limitations (like delayed SPARQL 1.1 support), but does it still make sense now? Or does Virtuoso have some general issues with evaluating graph traversals?
skos:narrowerTransitive
relation solves my main question indeed. So I can use it to get direct parent of a taxon. And if I use it with+
, it gives the same results as the query withrdfs:subClassOf
.Would it make sense to establish relations between ranks (like
:Species rdfs:subClassOf :Genus
or something similar)? I know that ranks hierarchy is not very well defined in general, but in UniProt you have a limited set of them (from NCBI, I guess/hope).I see it now. I suspected something like this. Probably a clarification to the ontology documentation would be helpful: "True for taxa that can appear as part of an organism's abbreviated lineage".
I added the difference to the question not to occupy too much space here.
Virtuoso, Oracle Semnet and DB2 Sparql support all have a similar issue with traversals with large fanouts. rdfs:subClassOf has a very deep and wide fanout in the UniProt database causing troubles for these types of engines.
We could do something with that but making it correct will be tricky as the semantics of direct
rdfs:subClassOf
will be wrong. Not every:Species
instance is an instance of a:Genus
will need to think about how to do this correctly.Opening a ticket
Changed my answer
Thanks a lot! I totally forgot about absent ranks (because in the NCBI data it's stored as an explicit value
no rank
). And I see that with ranks hierarchy it's more complicated than I thought..