How to get taxonomic lineage from UniProt with SPARQL
1
3
Entering edit mode
8.1 years ago
laughedelic ▴ 20

I'm trying to get taxonomic lineage from UniProt with the following SPARQL query (based on this and this answers):

prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#>
prefix taxon: <http://purl.uniprot.org/taxonomy/>
prefix :      <http://purl.uniprot.org/core/>

select ?ancestor ?name ?rank ?part_of_lineage
where {

  taxon:9597 rdfs:subClassOf ?ancestor .

  ?ancestor :scientificName ?name ;
            :partOfLineage ?part_of_lineage ;
            :rank ?rank .

} order by ?rank

This query yields 14 entries:

ancestor      name              rank           part_of_lineage

taxon:40674   Mammalia          :Class         true
taxon:9604    Hominidae         :Family        true
taxon:9596    Pan               :Genus         true
taxon:314293  Simiiformes       :Infraorder    false
taxon:33208   Metazoa           :Kingdom       true
taxon:9443    Primates          :Order         true
taxon:9526    Catarrhini        :Parvorder     true
taxon:7711    Chordata          :Phylum        true
taxon:207598  Homininae         :Subfamily     false
taxon:376913  Haplorrhini       :Suborder      true
taxon:89593   Craniata          :Subphylum     true
taxon:314295  Hominoidea        :Superfamily   false
taxon:2759    Eukaryota         :Superkingdom  true
taxon:314146  Euarchontoglires  :Superorder    true

You can try it with YASGUI.

Questions

  1. Note, that unlike in the referred answer, I used rdfs:subClassOf without +, because if I use rdfs:subClassOf+, I get this error message from UniProt:

    Exception:virtuoso.jdbc4.VirtuosoException: TN...: Exceeded 1000000000 bytes in transitive temp memory. use t_distinct, t_max or more T_MAX_memory options to limit the search or increase the pool

    Is it a bug in their storage backend or I'm misusing rdfs:subClassOf+?

  2. As far as I understand, the rdfs:subClassOf relationship is _semantically_ transitive, but it should connect only directly related entities. So if you want to get direct ancestor, you can use it one, if you want to get all ancestors, you can use "property paths" feature with rdfs:subClassOf+.

    But as far as I see from the results above and this query:

    describe <http://purl.uniprot.org/taxonomy/9597>
    from <http://sparql.uniprot.org/taxonomy>
    

    each node in the UniProt taxonomy graph is a subclass of many other nodes. Why is it so and how can I get just the direct parent of a given taxon in this situation?

  3. Having many ancestors, is there a way to order them (using SPARQL, without postprocessing results) _by taxonomic rank_ (not lexicographically as in the above query)? This would solve the previous question.

  4. If you open _Pan paniscus_ 9597 from the example above on UniProt, you will see that its lineage is much longer, but some nodes in it are grey. How is this lineage on the UniProt website is related to the results of the query?

  5. If you check the NCBI Taxonomy, the _abbreviated_ lineage is also 14 nodes:

    Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Pan
    

    But not all of them coincide! So what do I get in those results? Some random subset of the lineage?

  6. Finally, what does the :partOfLineage property mean? Documentation says:

    True for taxa that can appear as part of an organism's lineage

    But I don't understand what it means. Aren't all nodes parts of some lineage?

P.S. I read UniProt Taxonomy and Taxonomic lineage documentation. But it doesn't answer on my questions.

UPDATE Regarding my claim in the question 5.

Here is the lineage from UniProt (9597):

  • ✔︎ Eukaryota
  • ✘ Opisthokonta
  • ✔︎ Metazoa
  • ✘ Eumetazoa
  • ✘ Bilateria
  • ✘ Deuterostomia
  • ✔︎ Chordata
  • ✔︎ Craniata
  • Vertebrata
  • ✘ Gnathostomata
  • ✘ Teleostomi
  • Euteleostomi
  • ✘ Sarcopterygii
  • ✘ Dipnotetrapodomorpha
  • ✘ Tetrapoda
  • ✘ Amniota
  • ✔︎ Mammalia
  • ✘ Theria
  • Eutheria
  • ✘ Boreoeutheria
  • ✔︎ Euarchontoglires
  • ✔︎ Primates
  • ✔︎ Haplorrhini
  • ✔︎ Simiiformes
  • ✔︎ Catarrhini
  • ✔︎ Hominoidea
  • ✔︎ Hominidae
  • ✔︎ Homininae
  • ✔︎ Pan

Those in bold are the ones with :partOfLineage true. The checkmarks/crosses on the left mean that this taxon is present/absent in the query result. Note, that it contains both types of nodes (not only from the abbreviated linage).

UniProt SPARQL Taxonomy RDF • 3.6k views
ADD COMMENT
4
Entering edit mode
8.1 years ago
me ▴ 760

This a very good set of questions, and as lead developer for the sparql.uniprot.org endpoint I will try to answer them all. However, in the future, do try to ask at help@uniprot.org in the future, we do not always check Biostars!

  1. Transitive queries are problematic in all the SPARQL back-ends that are build upon relational style engines. Virtuoso, on sparql.uniprot.org sometimes works with rdfs:subClassOf, sometimes not depending on exact queries. Which is why we materialize this relation for rdfs:subClassOf. Which is why '+' is no longer needed. Although, I think I made an error in this materialization (I might miss one step at the top). I need to look into this.

  2. You can use the inverse relation from the skos side

    PREFIX taxon:<http://purl.uniprot.org/taxonomy/>
    PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
    PREFIX up:<http://purl.uniprot.org/core/>
    SELECT ?direct_child_of_bacteria
    {
    taxon:2 rdf:type up:Taxon ;
     skos:narrowerTransitive ?direct_child_of_bacteria .
    }

  3. No as we do not put that rank relation in our RDF files so no way to extract that.

  4. This is directly related to number 6. :partOfLineage means a node that Swiss-Prot curators consider informative enough to want to see in the Swiss-Prot flatfile serialization and is only present in the RDF so that we can roundtrip between these two formats.
  5. The difference is that not every taxon node in the NCBI has a rank. You can see that in this query. This is why you are seeing less results in the sparql result view than on the taxon record pages.
prefix rdfs:  <http: www.w3.org="" 2000="" 01="" rdf-schema#="">
prefix taxon: <http: purl.uniprot.org="" taxonomy=""/>
prefix :      <http: purl.uniprot.org="" core=""/>

select ?ancestor ?name ?rank ?part_of_lineage
where {

  taxon:9597 rdfs:subClassOf ?ancestor .

  ?ancestor :scientificName ?name ;
            :partOfLineage ?part_of_lineage .
  OPTIONAL {
    ?ancestor  :rank ?rank .
  }
}
ADD COMMENT
0
Entering edit mode

Thank you for the answers! I also sent a couple of emails to help@uniprot.org, but I just wasn't sure which way to contact you is more effective. I have some subsequent questions:

  1. I thought that the original reason may be related to some technical limitations (like delayed SPARQL 1.1 support), but does it still make sense now? Or does Virtuoso have some general issues with evaluating graph traversals?

  2. skos:narrowerTransitive relation solves my main question indeed. So I can use it to get direct parent of a taxon. And if I use it with +, it gives the same results as the query with rdfs:subClassOf.

  3. Would it make sense to establish relations between ranks (like :Species rdfs:subClassOf :Genus or something similar)? I know that ranks hierarchy is not very well defined in general, but in UniProt you have a limited set of them (from NCBI, I guess/hope).

  4. I see it now. I suspected something like this. Probably a clarification to the ontology documentation would be helpful: "True for taxa that can appear as part of an organism's abbreviated lineage".

  5. I added the difference to the question not to occupy too much space here.

ADD REPLY
1
Entering edit mode
  1. Virtuoso, Oracle Semnet and DB2 Sparql support all have a similar issue with traversals with large fanouts. rdfs:subClassOf has a very deep and wide fanout in the UniProt database causing troubles for these types of engines.

  2. We could do something with that but making it correct will be tricky as the semantics of direct rdfs:subClassOf will be wrong. Not every :Species instance is an instance of a :Genus will need to think about how to do this correctly.

  3. Opening a ticket

  4. Changed my answer

ADD REPLY
0
Entering edit mode

Thanks a lot! I totally forgot about absent ranks (because in the NCBI data it's stored as an explicit value no rank). And I see that with ranks hierarchy it's more complicated than I thought..

ADD REPLY

Login before adding your answer.

Traffic: 1678 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6