Question

Exhaustive up-to-date protein database?? nr?

2

Entering edit mode

10.5 years ago

purkantiramya ▴ 20

Hello all,

I am in a fix. For my work, I need an exhaustive and up-to-date database of
all protein sequences, especially covering ALL eukaryotes sequenced till date.
Thus, I downloaded the latest version of nr database from ncbi's ftp site.
However, I find that it does not contain all the putative protein
sequences as listed in individual genome databases.

For discussion purposes, let us consider Cyanophora paradoxa (taxonomic
id: 2762). According to the website
http://cyanophora.rutgers.edu/cyanophora/blast.php, it has 32,167 protein
coding sequences. However, there are only 731 gi ids corresponding to this
species in the latest nr database (25May2014 version). The Cyanophora
paradoxa's complete genome was published in 2012 (Price DC et al, 2012).

Thus, to me the only option seems to be to download protein sequences from individual genome projects and combining identical entries in one entry. Finally I shall append them to nr database to get the exhaustive set of proteins. However before I begin on this mammoth task, I thought to enquire if there is simpler solution to this problem.

Any help or suggestions are greatly welcomed.

Thanks a lot for your time,
Ramya

protein database nr database • 4.1k views

ADD COMMENT • link updated 10.5 years ago by hpmcwill ★ 1.2k • written 10.5 years ago by purkantiramya ▴ 20

0

Entering edit mode

I would suggest Uniprot and TrEMBL.

BTW, nice to meet you here Ramya :)

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.5 years ago by Bharat Iyengar ▴ 330

0

Entering edit mode

Thanks for the suggestion, Bharat. I agree Uniprot + TrEMBL ~= Uniparc is the closest I can get. However even that seems to list only 'complete proteomes' and not proteomes for draft genomes. Any other species and I will have to go to individual genome pages.

ADD REPLY • link 10.5 years ago by purkantiramya ▴ 20

0

Entering edit mode

You may find What Are The Proteomics Data Repositories? useful.

ADD REPLY • link 10.5 years ago by Bharat Iyengar ▴ 330

Ram · Accepted Answer · 2014-05-27

4

Entering edit mode

10.5 years ago

hpmcwill ★ 1.2k

Well the current NCBI nr contains 40,337,612 sequences originating from: GenBank CDS translations (excluding those from environmental samples and WGS projects), UniProtKB/SwissProt, PDB and PRF.

UniProt's UniParc database has more coverage, currently containing 63,875,797 unique protein sequences, including the CDS translations from the INSDC databases (DDBJ, EMBL-Bank and GenBank) which are excluded from NCBI nr, and additional sequences from various other sources.

You could also try having a look at SIMAP, this contains additional protein sequences from meta-genomics experiments.

However looking at the paper (Cyanophora paradoxa genome elucidates origin of photosynthesis in algae and plants) and searching the nucleotide databases it appears that the only submission has been to the Sequence Read Archive:

The corresponding assembly and associated feature annotations do not appear to have been submitted as yet, presumably due to this currently being a draft genome. Thus the protein sequences which are derived through translation of INSDC CDS features do not appear in NCBI nr or UniProtKB.

In general you should be able to start from a non-identical sequence archive such as UniParc and add unique protein sequences from other sources, assuming you can obtain them with appropriate annotations for your purposes.

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 10.5 years ago by hpmcwill ★ 1.2k

0

Entering edit mode

Thanks a lot @hpmcwill. It's definitely a nudge in the right direction. I browsed through Uniparc dataset and it does give me "complete proteomes" for species whose genomes have been completely sequenced. Now what I am missing is proteomes for even draft genomes. Specifically in the case of cyannophora paradoxa, I could find its corresponding proteome (32,167 sequences) at the link 'http://cyanophora.rutgers.edu/cyanophora/Cyanophora_paradoxa_MAKER_gene_predictions-022111-aa.fasta'. However, I was wondering if there was a database where such 'incomplete proteomes' are also listed together instead of me going for each of the individual genome pages. For now I shall proceed on your suggestion and take uniparc database as my base and build upon it. Thank you very much.

ADD REPLY • link 10.5 years ago by purkantiramya ▴ 20

0

Entering edit mode

Hello,

There is not such "draft proteome" flag in UniProtKB. Only "complete proteome" and "reference proteome" defined here.

However @hpmcwill has a good point here, the 32,167 protein sequences you are referring to seams (to me) not been published in generic database (INSDC,RefSeq,Ensembl...), this means also they won't be neither in UniParc or UniProtKB.

Here the list of database UniParc get data from: http://www.uniprot.org/help/uniparc see data source section.

Then I'm afraid the only place you can found those sequences is where you found them until they are submitted.

Sorry for not helping more

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.5 years ago by Ben • 0

0

Entering edit mode

Dear Ben, Thanks for the comment. However as everyone here agrees, I agree that for specific individual genome there is no way but to go to specific pages. It seems that many of the newly sequenced genome's protein annotations are not submitted to the central repositories.

ADD REPLY • link 10.5 years ago by purkantiramya ▴ 20

0

Entering edit mode

You may want to contact the providers of the missing individual genomes, and ask when they plan to submit their annotated sequence data to the major resources. It may be that this step in getting the data to as many users as possible has gotten forgotten in the day-to-day work of doing research on the data and they just need to be reminded.

ADD REPLY • link 10.5 years ago by hpmcwill ★ 1.2k