Are old versions of NCBI's nr stored somewhere?
5
3
Entering edit mode
10.7 years ago
5heikki 11k

Hello,

I'd like to study how NCBI's non-redundant protein database (nr) has developed over the years. However, I'm yet to find a way to download anything but the latest release from the NCBI ftp. Are those old versions lost for good from the public domain?

ncbi blast nr • 8.0k views
ADD COMMENT
0
Entering edit mode

I think I could live with protein subsets of GenBank releases, but I haven't exactly figured out from where to download those either.

ADD REPLY
0
Entering edit mode

You can try asking the folks at NCBI if they have archived versions they could give you access to... see http://www.ncbi.nlm.nih.gov/About/glance/contact_info.html for details of how to contact them.

ADD REPLY
5
Entering edit mode
10.7 years ago
hpmcwill ★ 1.2k

As far as I am aware NCBI do not provide archived versions of the 'nr' database, although they might be available upon request.

However since most of the sequences in 'nr' come from the protein translations in GenBank and UniProt provide archived releases for UniProtKB (which includes translations from EMBL-Bank), the UniProt releases would probably cover what you need. See ftp://ftp.uniprot.org/pub/databases/uniprot/previous_releases/.

Alternativly the UniProt's UniParc database is equivalent to the NCBI's 'nr' database, and provides additonal date information which would allow you to create subsets based on the database at a particular date. For the XML version of the UniParc database, which contains the additional information, see ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/uniparc/

Please note: the NCBI 'nr' database and the UniParc database are sets of non-identical sequences (i.e. the database contain one sequence for each unique sequence, with meta-data providing details of all the source entries containing the sequence). Non-redundant sequence databases such as UniRef or those generated with CD-HIT are different, and merge subsequences such as those from sequencing fragments into either the longest or a representitive sequence. To generate your own 'nr' like database(s) use the 'nrdb' program (http://blast.advbiocomp.com/pub/nrdb/) on your collection of sequences.

ADD COMMENT
2
Entering edit mode
8.8 years ago
lukaskoz ▴ 30

yes, NCBI should store old nr, e.g. every month, they are crucial for any bioinformatics, in meanwhile you can use my copy (far from perfect, but better than nothing)

ftp://genesilico.pl/lukaskoz/biological_databases/

ADD COMMENT
0
Entering edit mode

This is a blessing! You are my hero.

ADD REPLY
1
Entering edit mode
10.7 years ago
Neilfws 49k

I don't believe that NCBI archives old database versions.

The best I can suggest is that you start from GenPept, extract the sequence and the submission date, then bin sequences by a suitable date interval and derive your own non-redundant set using e.g. CD-HIT. It would be a lot of work.

ADD COMMENT
0
Entering edit mode

Yeah, this is a reasonable approach. The below script (requires EDirect in path) fetches sequences added in a given year. For cumulative databases one obviously needs to fetch the sequences generated before these years too. Anyway, I'm kind of shocked how difficult this whole task turned out to be. One would think that the whole point of e.g. GenBank releases was that you could go back to older releases to e.g. verify the results of some study..

#!/bin/bash
for i in {1990..2014}
do
esearch -db protein -query "("$I"[Publication Date])" | efetch -format fasta | grep . > $i.fasta
done

Although I have to point out that EDirect utilities are pretty horrible with large downloads as is the usual case with all non-ftp traffic between any location and NCBI, so the above script is guaranteed to fail in downloading all the proteins of the later years..

ADD REPLY
1
Entering edit mode
8.1 years ago
natasha.sernova ★ 4.0k

See my answer to this post, you will find NCBI-old version link inside:

where can I get environmental bacteria genome in fasta format (as many as possible)?

This one is just for bacteria:

ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/

This one is for the others:

ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq

ADD COMMENT
0
Entering edit mode
8.1 years ago
blanca ▴ 10

In case someone else needs the old nr version (with gi numbers), I have found it here:

http://www.matrixscience.com/help/seq_db_setup_nr_gi.html

ADD COMMENT

Login before adding your answer.

Traffic: 1564 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6