How to download the complete database Nucleotide collection (nr/nt)?
1
5
Entering edit mode
5.5 years ago

Dear all,

I need to perform a large BLAST search and I am using blastn in remote from the terminal. However, this takes way too long to give an answer and I have been thinking of creating a local database to speed the analysis. How can I download the all nr/nt repository? I see there is one here for the RefSeq. Would be this good? would it be already indexed or shall I create the index with makeblastdb?

Thank you

blast nr nt • 28k views
ADD COMMENT
2
Entering edit mode
ADD REPLY
0
Entering edit mode

Thank you, but what files shall I get for the nr/nt? I understand I should use ./update_blastdb.pl --decompress ... but with what other parameters? From the manual I can see ./update_blastdb.pl --decompress swissprot but I am not interested in proteins, thus -- since to build the database the command is makeblastdb -in {input} -dbtype nucl I tried: ` $ perl ~/src/blast/bin/update_blastdb.pl --decompress nucl Connected to NCBI nucl not found, skipping. ``` So what would be the correct syntax?

ADD REPLY
6
Entering edit mode
5.5 years ago
Fabio Marroni ★ 3.0k

You can use

wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz

wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz

These are fasta files, they are not indexed. You should use the makeblastdb command to index that.

You might also want to browse ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA and check what other databases are available.

ADD COMMENT
1
Entering edit mode

hi i downloaded the nr.gz, but why there is a single file whereas there are several nr* files at ftp://ftp.ncbi.nih.gov/blast/db/?

ADD REPLY
0
Entering edit mode

Thanks. Why two databases? shouldn't it be a single one nt/nr?

ADD REPLY
2
Entering edit mode

Because nt is nucleotide and nr is protein sequences. Depending on kind of searches you want to do you will need to choose one.

Get the pre-formatted database files from ftp://ftp.ncbi.nih.gov/blast/db/. There is no point in trying to get the fasta files and make your own. You need to download all files with nt and nr in the name. Put them in one directory. Uncompress the files and that is all that should be needed.

Note: You will need tens of GB of RAM to do local searches against nt or nr.

ADD REPLY
0
Entering edit mode

Thank you, this is clearer. And if I wanted to use update_blastdb.pl what would be the right syntax? would it be better than download manually?

ADD REPLY
3
Entering edit mode
perl update_blastdb.pl --decompress nt
perl update_blastdb.pl --decompress nr

Using this method will download all chunks automatically without having get multiple tar files. Make sure you have enough space available locally.

ADD REPLY
0
Entering edit mode

Thanks! it worked fine

ADD REPLY
0
Entering edit mode

Yes, I think that genomax's suggestion is good. It makes no sense downloading fasta and making the db when you can download the formatted db!

ADD REPLY
0
Entering edit mode

Thank you, I used update_blastdb and managed to create the local database. Alas, the speed of search is not much better than in remote. Is not much problem of RAM but of processor's speed, I'd say. Perhaps I can speed up using a supercluster... anyway, the pipeline works.

ADD REPLY
0
Entering edit mode

You could have a look at Diamond

ADD REPLY
0
Entering edit mode

Interesting! I'll look into it, thanks.

ADD REPLY
0
Entering edit mode

Alas, the speed of search is not much better than in remote.

If you have access to a local cluster, using multiple threads/cores, reading the entire database index into memory should be fast. How much RAM did you allocate to the job and how many cores did you use?

ADD REPLY
0
Entering edit mode

the desktop PC I am using has 64 Gb RAM and 16 threads. Can I assign RAM/threads on the blastn command directly? Otherwise, I am switching on the cluster and allocate resources using qsub.

ADD REPLY
0
Entering edit mode

Did you look at the inline help for blastn command? If you did not specify num_threads then you likely used just one core. 64G may not be enough for nt/nr searches. I would move to the cluster.

ADD REPLY
0
Entering edit mode

Yep, it is -num_threads integer. If the RAM is not enough, then cluster it is. Thanks

ADD REPLY
0
Entering edit mode

I use parallel to get things done more quickly - in my case, annotating a transcriptome can take a while! I found this tutorial helpful when I first started: https://github.com/LangilleLab/microbiome_helper/wiki/Quick-Introduction-to-GNU-Parallel

ADD REPLY

Login before adding your answer.

Traffic: 2231 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6