Question

Mapping Proteinids To Protein Cluster Ids

0

Entering edit mode

12.8 years ago

Robert Jenkins ▴ 120

Is there any ID mapping option (service/FTP)from where I can map proteinIDs(PID)/accession of a given set of genomes to their corresponding protein clusters .I tried to use the uniprot ID mapping interface which provides option of converting accessions to blastclustDB,However surprisingly the reference genome on which I'm working does not have uniport accessions,even after 2 years of it's release at NCBI. Since protein clusters are NCBI's Entrez service therefore I assume there should be linkage of proteinIDs to the protein cluster,which I'm not able to locate.

Example of the Protein Id types which could be used to assign protein clusters YP_003251185.1 or GI:261417503

identifiers conversion • 4.4k views

ADD COMMENT • link updated 12.7 years ago by Hamish ★ 3.3k • written 12.8 years ago by Robert Jenkins ▴ 120

1

Entering edit mode

By 'blastclustDB' do you mean Entrez Protein Clusters (ProtClustDB): http://www.ncbi.nlm.nih.gov/proteinclusters

ADD REPLY • link 12.8 years ago by Hamish ★ 3.3k

0

Entering edit mode

If you do mean Entrez protein clusters - that's an experimental NCBI service, not updated since 2010. I would not recommend using it.

ADD REPLY • link 12.8 years ago by Neilfws 49k

0

Entering edit mode

Also, you do not need UniProt accessions to use the UniProt ID mapping service. It accepts multiple kinds of identifier, including GIs.

ADD REPLY • link 12.8 years ago by Neilfws 49k

0

Entering edit mode

@Hamish,Yes it is proteinclusters at NCBI to which Uniprot mapping service refer as Blastclust.

ADD REPLY • link 12.8 years ago by Robert Jenkins ▴ 120

0

Entering edit mode

@neilfws:Uniprot ID mapping does not let me opt for GI to BlastclustDB coversion,if chosen so then it automatically changes to uniprotKBAC/ID

ADD REPLY • link 12.8 years ago by Robert Jenkins ▴ 120

0

Entering edit mode

@Robert Checking at UniProt I cannot find any mention of "blastclustdb", however looking for ProtClustDB finds the dbxref entry along with the News announcement detailing the addition of ProtClustDB (http://www.uniprot.org/news/2010/03/02/release) and the Identifier mapping service documentation detailing the names for use with the ID Mapping web service (http://www.uniprot.org/faq/28#id_mapping_examples). So in the interests of tracking this reference down so UniProt can correct it, where are you seeing "blastclustDB"?

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 12.8 years ago by Hamish ★ 3.3k

0

Entering edit mode

ProtClustDB only contains clustering data for selected RefSeq proteins, so it is entirely possible that your proteins are not present in the database. Please edit your question to provide sample protein_ids and the identifier(s) for the reference genome so we can verify that is the case and suggest an appropriate tactic to map the proteins.

ADD REPLY • link 12.8 years ago by Hamish ★ 3.3k

0

Entering edit mode

@Hamish sorry that was a typo,it is ProtClustDB indeed and I'm putting up the examples of protein Ids as an additional edit in the original question.Thanks

ADD REPLY • link 12.8 years ago by Robert Jenkins ▴ 120

score 0 · Answer 1 · 2012-02-28

Looking at your example YP_003251185.1, it is a provisional RefSeq and is in effect a direct clone of ACX76703 from the INSDC databases (DDBJ, EMBL-Bank & GenBank). Since UniProtKB uses EMBL-Bank as a primary data source (in the form of UniProtKB/TrEMBL; TrEMBL => translated EMBL-Bank), using the INSDC 'protein_id' when searching UniProtKB is more robust when there are possible synchronization issues. In this case a search in UniProtKB with ACX76703 finds C9RXR9. Checking the cross-references, this entry has the expected cross-reference to RefSeq YP_003251185, and does not contain a cross-reference to ProtClustDB.

Going back and looking at the RefSeq entry, it gives me the NCBI Taxonomy ID of the source organism: Geobacillus sp. Y412MC61, taxon:544556. Since UniProt uses the same identifiers in the UniProt Taxonomy (NEWT), the main difference being that UniProt sometimes choses to use a different authority and thus a different species name, finding the organism in UniProt is a search for the taxonomy id in their Taxonomy, which gives NCBI_TaxID=544556. As expected the nomenclature used is slightly different: Geobacillus sp. (strain Y412MC61). The UniProt Taxonomy entry also tells us that UniProtKB contains a complete proteome for this organism.

Since the protein sequences are available in UniProtKB, they will be clustered as part of the UniProt Reference Clusters (UniRef) databases.

Checking the "Related information" section of the right-hand side-bar for the RefSeq entry, there is a link to "Protein Clusters", which gives the corresponding entry in ProtClustDB: CLSK712430. Checking the the E-utilities documentation, Protein Clusters is available for searches. So to map from the RefSeq entry you can use ELink to get the identifiers (UID) of the related entries. For example:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=protein&db=proteinclusters&id=261417503

This gives the UID of the entry in ProtClustDB. Unfortunately since EFetch does not support ProtClustDB it is not possible to fetch the actual data, but ELink can be used again to get the UIDs of the member proteins of the cluster:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=proteinclusters&db=protein&id=712430

Alternatively you can have a look at the ProtClustDB data on the NCBI's FTP site (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/CLUSTERS/), this contains information about each cluster including the nicer cluster identifiers used on the web interface.

However, as Neil has mentioned, this data has not been updated since 2010 and it was an experimental project looking into clustering methods. This is the likely reason why the UniProtKB entries are missing the expected ProtClustDB cross-reference, since in most cases UniProt depend on the database maintainer providing the cross-reference data to be included in the entries.