Why is local BLAST so slow? (yet another one of these questions)
2
2
Entering edit mode
11 months ago
mykle hoban ▴ 40

I'm trying to figure out why my local BLAST queries are so slow. I'm running them on a machine with 32 cores and ~100GB of memory (which I know could be more but it seems like it should be sufficient).

Running a query consisting of a single (one!) 169bp sequence against the nucleotide database takes over two hours! Needless to say bigger queries only take longer. These are metabarcoding searches so a "real" query is typically 1500-4000 short sequences in the 100-300bp range.

I have tried the test query using both one and multiple threads and ~125 minutes is about the best I've been able to do.

What am I missing? Is 100GB just not enough here? Is this just a limit of how local BLAST searches work?

blast ncbi sequence • 1.8k views
ADD COMMENT
2
Entering edit mode

What command are you using? What version of blast? What nucleotide database?

ADD REPLY
0
Entering edit mode

You left out the most critical piece of info? What are you blasting against? What is the size of that database. Are you using -num_threads option to indicate all 32?

I think 169 bp may be considered "short" so adding -task blastn-short may be worth checking.

ADD REPLY
1
Entering edit mode
11 months ago

Is your BLAST process really using all 32 cores? top can help you check this.

It is plausible that your hard disk (I/O) is too slow and thus creating a bottleneck.

ADD COMMENT
2
Entering edit mode

htop shows all requested cores in use, although they are only fully engaged right at the start (when blastn is first executed). Afterward they all show some activity but it appears to be quite low.

You may be right that there are some disk i/o issues

ADD REPLY
1
Entering edit mode

So if you're stuck with the disk/filesystem you have, probably the best performance can come from reducing the number of cores you're using. There are two reasons for this:

  1. There is an overhead cost of different BLAST threads communicating amongst each other
  2. The different BLAST threads trying to get data simultaneously can lead the filesystem to have to jump from one part of the db to another. This can decrease the overall throughput compared to if it were being read continuously.

(As Mensur highlights, there can additionally be RAM issues. And the word size (i.e., the -task that you use should depend on the similarity you expect of hit sequences. megablast will be faster than blastn which will be faster than short. Using the faster algorithms reduces your power to detect more divergent sequences).

ADD REPLY
0
Entering edit mode
11 months ago
Mensur Dlakic ★ 28k

Two most likely problems were already outlined by SequenceServer and it is very likely that 100 Gb is not enough. That means disk swapping, which will slow things down. On a reasonably fast computer with 32 cores and adequate memory (say, 512 GB), this search should take no more than 5 or so minutes.

If you are interested only in close matches, using -task megablast will also speed things up. There is no need to use -task short as 169 bp is long enough.

ADD COMMENT
0
Entering edit mode

This would make sense to me, but as far as I can tell (according to htop), the blastn process is only using around ~3G of physical (though the virtual memory is much higher than that)

ADD REPLY

Login before adding your answer.

Traffic: 1230 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6