Exhaustive up-to-date protein database?? nr?
1
2
Entering edit mode
10.5 years ago

Hello all,

I am in a fix. For my work, I need an exhaustive and up-to-date database of
all protein sequences, especially covering ALL eukaryotes sequenced till date.
Thus, I downloaded the latest version of nr database from ncbi's ftp site.
However, I find that it does not contain all the putative protein
sequences as listed in individual genome databases.

For discussion purposes, let us consider Cyanophora paradoxa (taxonomic
id: 2762). According to the website
http://cyanophora.rutgers.edu/cyanophora/blast.php, it has 32,167 protein
coding sequences. However, there are only 731 gi ids corresponding to this
species in the latest nr database (25May2014 version). The Cyanophora
paradoxa's complete genome was published in 2012 (Price DC et al, 2012).

Thus, to me the only option seems to be to download protein sequences from individual genome projects and combining identical entries in one entry. Finally I shall append them to nr database to get the exhaustive set of proteins. However before I begin on this mammoth task, I thought to enquire if there is simpler solution to this problem.

Any help or suggestions are greatly welcomed.

Thanks a lot for your time,
Ramya

protein database nr database • 4.1k views
ADD COMMENT
0
Entering edit mode

I would suggest Uniprot and TrEMBL.

BTW, nice to meet you here Ramya :)

ADD REPLY
0
Entering edit mode

Thanks for the suggestion, Bharat. I agree Uniprot + TrEMBL ~= Uniparc is the closest I can get. However even that seems to list only 'complete proteomes' and not proteomes for draft genomes. Any other species and I will have to go to individual genome pages.

ADD REPLY
0
Entering edit mode
ADD REPLY
4
Entering edit mode
10.5 years ago
hpmcwill ★ 1.2k

Well the current NCBI nr contains 40,337,612 sequences originating from: GenBank CDS translations (excluding those from environmental samples and WGS projects), UniProtKB/SwissProt, PDB and PRF.

UniProt's UniParc database has more coverage, currently containing 63,875,797 unique protein sequences, including the CDS translations from the INSDC databases (DDBJ, EMBL-Bank and GenBank) which are excluded from NCBI nr, and additional sequences from various other sources.

You could also try having a look at SIMAP, this contains additional protein sequences from meta-genomics experiments.

However looking at the paper (Cyanophora paradoxa genome elucidates origin of photosynthesis in algae and plants) and searching the nucleotide databases it appears that the only submission has been to the Sequence Read Archive:

The corresponding assembly and associated feature annotations do not appear to have been submitted as yet, presumably due to this currently being a draft genome. Thus the protein sequences which are derived through translation of INSDC CDS features do not appear in NCBI nr or UniProtKB.

In general you should be able to start from a non-identical sequence archive such as UniParc and add unique protein sequences from other sources, assuming you can obtain them with appropriate annotations for your purposes.

ADD COMMENT
0
Entering edit mode

Thanks a lot @hpmcwill. It's definitely a nudge in the right direction. I browsed through Uniparc dataset and it does give me "complete proteomes" for species whose genomes have been completely sequenced. Now what I am missing is proteomes for even draft genomes. Specifically in the case of cyannophora paradoxa, I could find its corresponding proteome (32,167 sequences) at the link 'http://cyanophora.rutgers.edu/cyanophora/Cyanophora_paradoxa_MAKER_gene_predictions-022111-aa.fasta'. However, I was wondering if there was a database where such 'incomplete proteomes' are also listed together instead of me going for each of the individual genome pages. For now I shall proceed on your suggestion and take uniparc database as my base and build upon it. Thank you very much.

ADD REPLY
0
Entering edit mode

Hello,

There is not such "draft proteome" flag in UniProtKB. Only "complete proteome" and "reference proteome" defined here.

However @hpmcwill has a good point here, the 32,167 protein sequences you are referring to seams (to me) not been published in generic database (INSDC,RefSeq,Ensembl...), this means also they won't be neither in UniParc or UniProtKB.

Here the list of database UniParc get data from: http://www.uniprot.org/help/uniparc see data source section.

Then I'm afraid the only place you can found those sequences is where you found them until they are submitted.

Sorry for not helping more

ADD REPLY
0
Entering edit mode

Dear Ben, Thanks for the comment. However as everyone here agrees, I agree that for specific individual genome there is no way but to go to specific pages. It seems that many of the newly sequenced genome's protein annotations are not submitted to the central repositories.

ADD REPLY
0
Entering edit mode

You may want to contact the providers of the missing individual genomes, and ask when they plan to submit their annotated sequence data to the major resources. It may be that this step in getting the data to as many users as possible has gotten forgotten in the day-to-day work of doing research on the data and they just need to be reminded.

ADD REPLY

Login before adding your answer.

Traffic: 1830 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6