Problems With Blast And Nr Database
1
0
Entering edit mode
14.0 years ago

I'm familiar with the BLAST family of software: I've used both the old interface (blastall, formatdb, et al) and the new interface (blastx, makeblastdb, et al). However, I've always used it with in-house databases. I've never tried downloading and using NCBI's non-redundant database...which is what I'm trying to do now.

Turns out someone in our lab recently downloaded the nr and nt databases using the update_blastdb.pl script, so that saves me that trouble. However, I am having issues when I try to run BLAST against the database.

I created a Fasta file that has a single query sequence in it...maybe several hundred bp long. When I just do a simple command like one of the two below, it runs without any end in sight (consuming a lot of RAM too).

$ blastall -p blastx -i test.fasta -d /data/blast/db/nr -m 7
^C
$ blastall -p blastn -i test.fasta -d /data/blast/db/nt -m 7
^C

So I though 'ok, maybe I'm supposed to point it at the alias file', so I tried the following commands, ending immediately in an error.

$blastall -p blastx -i test.fasta -d /data/blast/db/nr.pal -m 7
[blastall] FATAL ERROR: AT1G51370.2: Database /data/blast/db/nr.pal was not found or does not exist
$ blastall -p blastn -i test.fasta -d /data/blast/db/nt.pal -m 7
[blastall] FATAL ERROR: AT1G51370.2: Database /data/blast/db/nt.pal was not found or does not exist

I've run fastacmd to make sure the databases are working correctly and I don't see any problems.

fastacmd -d /data/blast/db/nr -I
Database: All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF
excluding environmental samples from WGS projects 
           10,688,764 sequences; 3,647,636,407 total letters

File names:
/data/blast/db/nr.00
   Date: Mar 25, 2010  5:42 PM    Version: 4    Longest sequence: 36,805 res
/data/blast/db/nr.01
   Date: Mar 25, 2010  5:42 PM    Version: 4    Longest sequence: 35,213 res
/data/blast/db/nr.02
   Date: Mar 25, 2010  5:42 PM    Version: 4    Longest sequence: 33,423 res
/data/blast/db/nr.03
   Date: Mar 25, 2010  5:42 PM    Version: 4    Longest sequence: 33,423 res

$ fastacmd -d /data/blast/db/nt -I
Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS,
GSS,environmental samples or phase 0, 1 or 2 HTGS sequences) 
           11,257,610 sequences; 30,637,862,539 total letters

File names:
/data/blast/db/nt.00
   Date: Mar 25, 2010  2:13 PM    Version: 4    Longest sequence: 7,215,267 bp
/data/blast/db/nt.01
   Date: Mar 25, 2010  2:13 PM    Version: 4    Longest sequence: 9,105,828 bp
/data/blast/db/nt.02
   Date: Mar 25, 2010  2:13 PM    Version: 4    Longest sequence: 7,074,893 bp
/data/blast/db/nt.03
   Date: Mar 25, 2010  5:42 PM    Version: 4    Longest sequence: 6,365,727 bp
/data/blast/db/nt.04
   Date: Mar 25, 2010  5:42 PM    Version: 4    Longest sequence: 27,905,053 bp
/data/blast/db/nt.05
   Date: Mar 25, 2010  5:42 PM    Version: 4    Longest sequence: 13,033,779 bp
/data/blast/db/nt.06
   Date: Mar 25, 2010  2:13 PM    Version: 4    Longest sequence: 8,545,929 bp
/data/blast/db/nt.07
   Date: Mar 25, 2010  5:42 PM    Version: 4    Longest sequence: 10,467,782 bp
/data/blast/db/nt.08
   Date: Mar 25, 2010  5:42 PM    Version: 4    Longest sequence: 10,341,314 bp

Any ideas what the issue might be?

blast database • 12k views
ADD COMMENT
4
Entering edit mode

Your first commands should be correct. How long have you let them run? Searching the nr/nt databases might take a long time, you should probably try a smaller database first as a proof of concept.

ADD REPLY
1
Entering edit mode

update_blastdb.pl how did your co-worker get it working? We are having difficulties C: What Is The Best Way To Download Genbank Locally?

ADD REPLY
3
Entering edit mode
14.0 years ago

As far as I have noticed, current sizes of NCBI's databases are hardly compatible with single-core usage. In my recent tests BLASTP of ~300AA sequence against NR database took ca. 10 minutes on machine with 16 cores and 72GB of RAM. It's not very informative but at least it should give you an idea about requirements of BLAST with current databases.

The other issue is that old C-based NCBI toolkit is significantly slower than the new one, written in C++ and referred as BLAST+ applications. Make sure you're using the most recent version of BLAST+ (the older ones had some problems with stability).

Newest software, lots of RAM and parallelization are probably the cure for your problems.

ADD COMMENT
0
Entering edit mode

Thanks. I was just surprised a single sequence would take this long based on my previous experience--of course I never worked with databases as big as NR.

ADD REPLY

Login before adding your answer.

Traffic: 1940 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6