My question: how to retrieve all entries of Escherichia coli K12 proteome using UniProt SPARQL endpoint?
endpoint: http://sparql.uniprot.org/sparql
Context: I want to get entries of UniProt ECOLI (Escherichia coli K-12) complete proteome. I expect to find only UniProtKB/Swiss-Prot proteins (reviewed entries).
I did the following things:
Retrieve the list of proteins with KW 'complete proteome' (keywords:181) and taxon:83333 (Escherichia coli K-12)
PREFIX keywords:<http://purl.uniprot.org/keywords/>
PREFIX taxon:<http://purl.uniprot.org/taxonomy/>
PREFIX up:<http://purl.uniprot.org/core/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
SELECT ?reviewed (count(distinct ?protein) as ?proteinCount)
WHERE
{
?protein a up:Protein ;
up:organism ?organism ;
up:organism taxon:83333 ;
up:classifiedWith|(up:classifiedWith/rdfs:subClassOf)?kw ;
up:reviewed ?reviewed .
?kw a up:Concept .
VALUES (?kw) { (keywords:181) }
}
GROUP BY ?reviewed
result:
|reviewed |proteinCount|
|"true"xsd:boolean |"4313"xsd:int|
|"false"xsd:boolean |"2"xsd:int|
It is an unexpected result for me, as there are 2 TrEMBL entries (up:reviewed false).
In fact, an organism may have several proteomes. Well, with E. coli, I should have anticipated that... anyway! Definition of proteomes is well-documented in the UniProt web site (reference_proteome). And effectively there are 2 proteomes for Escherichia coli K-12:
- UP000000318 (Escherichia coli ATCC 27325)
- UP000000625 (Escherichia coli MG1655) - Reference proteome
Instead of keywords:181 (complete proteome), I should have used keywords:1185 (reference proteome) (KW-1185):
PREFIX keywords:<http://purl.uniprot.org/keywords/>
PREFIX taxon:<http://purl.uniprot.org/taxonomy/>
PREFIX up:<http://purl.uniprot.org/core/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
SELECT ?reviewed (count(distinct ?protein) as ?proteinCount)
WHERE
{
?protein a up:Protein ;
up:organism ?organism ;
up:organism taxon:83333 ;
up:classifiedWith|(up:classifiedWith/rdfs:subClassOf)?kw ;
up:reviewed ?reviewed .
?kw a up:Concept .
VALUES (?kw) { (keywords:1185) }
}
GROUP BY ?reviewed
result:
|reviewed |proteinCount|
|"true"xsd:boolean |"4313"xsd:int|
bingo!
Let's display proteome data for taxon:83333
PREFIX taxon:<http://purl.uniprot.org/taxonomy/>
PREFIX up:<http://purl.uniprot.org/core/>
PREFIX proteome:<http://purl.uniprot.org/proteome/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
SELECT ?proteome ?reviewed (count(distinct ?protein) as ?proteinCount)
WHERE
{
?protein a up:Protein ;
up:organism ?organism ;
up:organism taxon:83333 ;
up:proteome ?proteome ;
up:reviewed ?reviewed .
}
GROUP BY ?reviewed ?proteome
proteome reviewed proteinCount
http://purl.uniprot.org/proteomes/UP000000318#Chromosome "true"xsd:boolean "4255"xsd:int
http://purl.uniprot.org/proteomes/UP000000318#Chromosome "false"xsd:boolean "2"xsd:int
http://purl.uniprot.org/proteomes/UP000000625#Chromosome "true"xsd:boolean "4313"xsd:int
And now, get the list of proteins for UP000000625#Chromosome
SPARQL query:
PREFIX proteome:<http://purl.uniprot.org/proteome/>
PREFIX uniprotkb:<http://purl.uniprot.org/uniprot/>
PREFIX up:<http://purl.uniprot.org/core/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
SELECT ?proteome ?protein
WHERE
{
?protein a up:Protein ;
up:reviewed true ;
up:proteome ?proteome .
VALUES (?proteome) {(proteome:UP000000625#Chromosome)}
}
Unfortunately, I get an error message
Encountered " "}" "} "" at line 13, column 1. Was expecting one of: ")" ... "true" ... "false" ... "UNDEF" ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Replacing UP000000625#Chromosome by UP000000625 gives no result
Question: who know how to retrieve proteins of a given UniProt proteome (UP000000625 in my case)?