Question

Truly Parallel Blasts With Blast+

8

Entering edit mode

11.4 years ago

Eric Normandeau 11k

Hi,

I find myself once again having to run blast+ programs to blast large amounts of sequences (100,000+) on swissprot, refseq, nr, etc.

blast+ can use multiple cores, but the way it is implemented means that when using multiple cores, cores that terminate their calculation earlier have to wait for the longer calculations before getting new work to do. On 8-16 cores, it means that the processors are idle 2/3+ of the time.

I am thinking of implementing something to solve this problem where a master program would call the blast program on single sequences, launch new blasts as the processors get free, and keep track of the order in which the results should go to combine them. Since the database is cached the first time it is used, I think the overhead of launching a new blast command each time should be minimal. Dealing with some of the formats may end up being somewhat painful (format 0, for example), but for tabular formats (ex: 6), this should be pretty trivial.

The only other alternative I have at the moment is to split the sequence file into n smaller files and run n blasts. I would like to have a solution where all there is to do is to launch a single command, for example:

parallel_blast.py blastp [options] input_file

Before I do that, however, I would like to know if anybody is aware of an existing solution to that problem. I browsed the other blast posts on the forum and am already doing most of the possible suggestions in there. This is not a very difficult problem, but if somebody already produced a quality solution, I'd rather use it than hack something ugly to solve it.

blast+ parallel • 16k views

ADD COMMENT • link updated 3.6 years ago by Xapple ▴ 30 • written 11.4 years ago by Eric Normandeau 11k

3

Entering edit mode

How about this GNU Parallel - parallelize serial command line programs without changing them imo the best tutorial on biostar at the moment. Also shows how to split records.

ADD REPLY • link 11.4 years ago by Michael 55k

0

Entering edit mode

Thank you Michael. I didn't remember a blast example in there. I will certainly try that. Want to add this comment as an answer? If it works well for me I'll be able to mark it as the correct answer.

ADD REPLY • link 11.4 years ago by Eric Normandeau 11k

Ram · Answer 1 · 2013-07-06

4

Entering edit mode

11.4 years ago

Michael 55k

According to this tutorial: GNU Parallel - parallelize serial command line programs without changing them the examples under "Running Blast on multiple machines" should do what you want, haven't tested myself, but it looks like the source is very good

In particular:

  cat 1gb.fasta | parallel --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > results

should do what you want. Blocksize might be something to experiment with. Invoking a blast process for each single or only few sequence might have too high overhead for program start and database loading.

I do not know how good the error handling is, but it is important to keep track of processes crashing, which can still happen with blast+. Another issue might be that you will get multiple Blast headers if using the default output format.

ADD COMMENT • link 11.4 years ago by Michael 55k

1

Entering edit mode

Finally had time to try it. Works like a charm. It does output multiple headers though, one per block treated. For some of the output formats, it makes no difference and for the other, it has no importance for what I want to do. Thanks!

ADD REPLY • link 11.4 years ago by Eric Normandeau 11k

1

Entering edit mode

parallel has a --halt-on-error (or just --halt) option which controls how to handle errors in the spawned processes, and could be useful for catching crashed blast runs. The default error handling will report the number of processes that exited with an error as the exit status for parallel itself.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by Thomas Sibley ▴ 20

1

Entering edit mode

To deal with multiple output headers, you can use something like csplit or chunk the output into separate files rather than sending everything to stdout:

parallel --block 100k --recstart '>' --pipe blastn -query - -html ... \
    < 1gb.fasta \
    | csplit -f results- -b %03d.html - '%^<HTML%' '/^<HTML/' {*}

parallel --recstart '>' -N 1 --pipe blastn -query - -out 'result-{#}.html' -html ... \
    < 100-whole-genomes.fasta

The -N option lets you specify exactly how many records to pipe to each process instead of chunking by size. It's doc'd to be a little slower (understandably), but I don't think the difference is appreciable compared to the runtimes of blast itself.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by Thomas Sibley ▴ 20

0

Entering edit mode

Thomas, thanks for the -N option. More logical than using chunk sizes in this case.

ADD REPLY • link 10.3 years ago by Eric Normandeau 11k

score 1 · Answer 2 · 2013-07-05

Hi Eric, I know a lot of people (including me) with their in house solution (using MPI for my part). However, there is a solution on the market http://www.paracel.com/ but as it is "truly" on the market, this has a cost ;-) You also may take a look at "MPI-Blast".

EDIT: I've not looked into MPI-BLAST for a while, and as far as I remember it can split the query AND the database, and then re-computes a relevant e-value (as it depends on the db size). My version is made with python which uses the mpi4py library. I split only the query and distribute as many jobs as the number of "sub-fasta" files, and at the end, the output is gathered into one file. It works only with the tab output, and is adapted to the cluster I use. However, I think it is easily understandable and modifiable for your own use if you wish. I can share it (I put it here : https://code.google.com/p/bioman/source/browse/mpiBlastHopper.py ). This was not intended to be public, so it's not very clean ;-)

I run it like this (aprun is like mpirun for that specific cluster): aprun -n 480 -N 24 mpiBlastHopper.py file.faa mydb bin/blastp "-evalue 1e-4 -num_alignments 5 -num_descriptions 5"

score 1 · Answer 3 · 2014-02-27

1

Entering edit mode

10.7 years ago

xapple ▴ 230

Here is my in-house solution using Python and a SLURM cluster if anyone is interested:

https://github.com/limno/parallelblast

ADD COMMENT • link 10.7 years ago by xapple ▴ 230

0

Entering edit mode

Updated version: https://github.com/xapple/seqsearch

ADD REPLY • link 3.6 years ago by Xapple ▴ 30

score 1 · Answer 4 · 2014-04-08

1

Entering edit mode

10.6 years ago

Jeremy Leipzig 22k

I run a lot of pairwise blasts for mitomaster and I wrote this node.js wrapper to allow them to run on their own server:

https://www.npmjs.org/package/blast-wrapper

https://github.com/leipzig/blast-wrapper

ADD COMMENT • link 10.6 years ago by Jeremy Leipzig 22k

score 1 · Answer 5 · 2019-06-18

Bumping an old thread...

For large multicpu machines this script might do what the OP wants:

http://saf.bio.caltech.edu/pub/software/molbio/many_blast_plus_1cpu.sh

syntax: many_blast_plus_1cpu.sh blast_program infile outfile Ncpus blast_params

On machines with more than 40 threads the script was about twice as fast as running the programs directly with -num_threads set to the same number as Ncpus.

Also I recently updated my blastmerge program to blastplustmerge, so it will now work with (slightly modified) blast 2.9.0+. This is a reimplementation of the methods described here:

https://academic.oup.com/bioinformatics/article/19/14/1865/246052

for the current NCBI programs. This is used after running queries against split databases on compute nodes - it reassembles the result into what it would have been had the query been against the whole database at once. Only -fmt 0 is supported. This program and a bunch of related scripts (for setting up and splitting the databases, running the blast jobs) are here:

https://sourceforge.net/projects/parallelblast-plus/

Relatively small nodes may be used since only 1/Nth of the database is needed on each compute node. A web interface is also provided but I would suggest restricting it to internal use. Patches for my method of taxonomy restricted searches, which use two external files per database, are included - these work with either v4 or v5 NCBI blast databases. This method supports internal taxonomy nodes ("rodentia" rather than a list of all rodent taxonomy IDs.) Binaries (Centos 7) using those patches are also available from the URL above. In addition to the taxonomy restriction a method to force in OID ranges was added that does not require the creation of NAL/PAL files. If compute nodes have enough disk/memory to support large databases those may be effectively sliced by setting OID ranges like 1,N on the first, N+1,2N on the second, and so forth. That would eliminate the need to split the databases physically, but the order of the database contents should probably be randomized before using them like this (to avoid blocks of related sequences from a single organism, which may have all been added to the database at the same time, for instance.)

score 0 · Answer 6 · 2014-04-08

0

Entering edit mode

10.6 years ago

Zag ▴ 10

A simple solution could also be splitting your input and running blast with xargs.

Assuming you want to run on 8 cores and your input files are splitted_0.fasta, splitted_1.fasta ... splitted_7.fasta, you can run

seq 0 7 | xargs -I {} blastn -task megablast -query splitted_{}.fasta -db your_database -out tmp_{}.tsv
cat tmp_*.tsv > tmp.tsv
seq 0 7 | xargs -I {} rm splitted_{}.fasta tmp_{}.tsv

See How To Split One Big Sequence File Into Multiple Files With Less Than 1000 Sequences In A Single File for how to split a fasta in multiple files.

ADD COMMENT • link 10.6 years ago by Zag ▴ 10

0

Entering edit mode

Regarding splitting input, the latest version of my parallel blast package here:

https://sourceforge.net/projects/parallelblast-plus/

has a new feature which may be useful for some. The program "fasta_range_calculate.c" is provided, which when run on a fasta file finds the offsets to the first byte (">") and last byte (character before a ">" or the last character in a file) for the range of data in each of N chunks. It does this very quickly by jumping to offsets in the file of size/N and then scanning up to the next ">". So even an input file which was hundreds of gigabytes in size can have these locations calculated in a fraction of a second for say N=100.

The patches provided for the NCBI C++ toolkit modify the blast programs so that they look at environmental variables QUERY_OFFSET_FIRST and QUERY_OFFSET_LAST, which can be set with these values for a particular chunk, and only that part of the input is read. This allows blast programs running on a compute node to read just that section of the file (over NFS, for instance) without having to scan any of the sequence around it. Of course if enough of these nodes do that they are still likely to saturate the network at the file server, but at least it is more efficient than doing this on each compute node:

cat /mounted/NFS/volume/input.fasta | fastarange 1000001 2000000 | blast (args)

which would pass the entire input set to each compute node over the network. It could also be done by running a splitter on the master, sending that into N pipes, and then have something else send the data from each of those to the compute nodes, but this eliminates the splitter (which has to scan the entire input set) and those pipes.

ADD REPLY • link 3.8 years ago by dmathog ▴ 40