How to retrieve EC numbers and KOs for proteins of several taxons?
1
1
Entering edit mode
6.5 years ago
cleb ▴ 70

This is cross-posted from here.

I would like to use uniprot's sparql endpoint to retrieve all proteins that

  1. are reviewed (required)
  2. are associated with taxonomy IDs 562 and 3702 (required)
  3. have a KO associated with them (optional)
  4. "evidence for the existence of a protein " should be either on protein or transcript level (required)
  5. have an EC number associated with them (required)

I have so far (points 1 and 2):

PREFIX up:<http://purl.uniprot.org/core/>
PREFIX taxon:<http://purl.uniprot.org/taxonomy/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>

SELECT ?protein ?taxon ?name
WHERE
{        
        ?taxon a up:Taxon .
        ?taxon up:scientificName ?name .
        VALUES ?taxonlist { taxon:562 taxon:3702 }
        ?taxon rdfs:subClassOf ?taxonlist .

        ?protein a up:Protein .
        ?protein up:organism ?taxon . 
        ?protein up:reviewed true .  # have to be reviewed        

}

This, however, does not return anything for 3702. How can this be fixed and how can I incorporate points 3-5?

Additionally, is there now a way to connect uniprot's sparql endpoint with rhea's sparql endpoint to retrieve all associated reactions and their stoichiometries (with ChEBI IDs) for the selected proteins from above? Example 19 seems to suggest that this connection is possible but I am not quite sure how to accomplish it.

sparql uniprot semantic-web • 2.0k views
ADD COMMENT
2
Entering edit mode
6.5 years ago
me ▴ 760

1) Is correct in the query with

 ?protein up:reviewed true .

2) The query in the Q. does not return anything for taxon:3702 as there are no rdfs:subClasses for Aribidopsis Thaliana, it is a leaf node. This means the entry is directly linked to that taxon instead of via it's ancestors. This is fixed by changing the query slightly to deal with both the ancestor and direct case (both sides of the UNION below)

    VALUES ?taxonlist { taxon:3702 taxon:562}
    {
        ?taxon rdfs:subClassOf ?taxonlist .
        ?protein up:organism ?taxon . 
    } UNION {
        ?protein up:organism ?taxonlist . 
    }

3) we use the cross reference section which are done via rdfs:seeAlso . But as there is the possibility of more than one KO per entry we group them with a subquery.

OPTIONAL {
    SELECT ?protein (GROUP_CONCAT(?ko; SEPARATOR=", ") AS ?kos)
    WHERE{
      ?protein rdfs:seeAlso ?ko .
      ?ko up:database <http:
    } GROUP BY ?protein
}

4) to use the existience/evidence for concept at Protein or Transcript level we add

{
     ?protein up:existence up:Evidence_at_Protein_Level_Existence .
} UNION {
    ?protein up:existence up:Evidence_at_Transcript_Level_Existence .
}

5) To make sure the entry is annotated as an enzyme. We use the same subquery idea as for the KO links but now not OPTIONAL. To make one value of the many potential ECs we use a subquery with a GROUP_CONCAT. The long line with up:enzyme is the different ways uniprot links an ?ec to an entry.

SELECT ?protein (GROUP_CONCAT(?ec; SEPARATOR=", ") AS ?ecs)
WHERE{
  ?protein up:enzyme|((up:component|up:domain)/up:enzyme) ?ec
} GROUP BY ?protein

Combing it in one query gives

PREFIX up:<http://purl.uniprot.org/core/>
PREFIX taxon:<http://purl.uniprot.org/taxonomy/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>

SELECT 
    ?protein 
    ?taxon 
    ?name
    ?kos
    ?ecs
WHERE
{   
    ?protein a up:Protein .
    ?protein up:reviewed true .  # have to be reviewed        
    ?taxon a up:Taxon .
    ?taxon up:scientificName ?name .
    VALUES ?taxonlist { taxon:3702  taxon:562 }
    {
        ?taxon rdfs:subClassOf ?taxonlist .
        ?protein up:organism ?taxon . 
    } UNION {
        ?protein up:organism ?taxonlist . 
    }

    {
        ?protein up:existence up:Evidence_at_Protein_Level_Existence .
    } UNION {
        ?protein up:existence up:Evidence_at_Transcript_Level_Existence .
    }
    {
       SELECT ?protein (GROUP_CONCAT(?ec; SEPARATOR=", ") AS ?ecs)
       WHERE{
           ?protein up:enzyme|((up:component|up:domain)/up:enzyme) ?ec
       } GROUP BY ?protein
    }
    OPTIONAL {
        SELECT ?protein (GROUP_CONCAT(?ko; SEPARATOR=", ") AS ?kos)
        WHERE{
            ?protein rdfs:seeAlso ?ko .
            ?ko up:database <http://purl.uniprot.org/database/KO>
        } GROUP BY ?protein
    } 
}

Which is testable at sparql.uniprot.org.

ADD COMMENT
0
Entering edit mode

Running out of space for the Rhea part, we will make a separate Q&A

ADD REPLY
0
Entering edit mode

I now opened a new question here. Thanks for helping out!

ADD REPLY
0
Entering edit mode

Did you have a chance to look at the second question (no pressure, just very curious :) )? If so, is this connection possible? Alternatively, one could maybe also try to get all reactions (substrates, products and stoichiometric factors) for all the EC numbers. Thanks!

ADD REPLY
0
Entering edit mode

By chemistry is a bit limited so I need my colleague to help with stoichiometric factors and it's production week so time is hard to get.

ADD REPLY
0
Entering edit mode

Thanks for the reply. I opened a more specific question for this here. I guess one can infer directly from the scheme how to access the stoichiometries but my attempts all failed.

ADD REPLY
0
Entering edit mode

I added a new post here; would be gr5eat if you could take a look, thanks!

ADD REPLY

Login before adding your answer.

Traffic: 1996 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6