Hi,
I find myself once again having to run blast+ programs to blast large amounts of sequences (100,000+) on swissprot, refseq, nr, etc.
blast+ can use multiple cores, but the way it is implemented means that when using multiple cores, cores that terminate their calculation earlier have to wait for the longer calculations before getting new work to do. On 8-16 cores, it means that the processors are idle 2/3+ of the time.
I am thinking of implementing something to solve this problem where a master program would call the blast program on single sequences, launch new blasts as the processors get free, and keep track of the order in which the results should go to combine them. Since the database is cached the first time it is used, I think the overhead of launching a new blast command each time should be minimal. Dealing with some of the formats may end up being somewhat painful (format 0, for example), but for tabular formats (ex: 6), this should be pretty trivial.
The only other alternative I have at the moment is to split the sequence file into n smaller files and run n blasts. I would like to have a solution where all there is to do is to launch a single command, for example:
parallel_blast.py blastp [options] input_file
Before I do that, however, I would like to know if anybody is aware of an existing solution to that problem. I browsed the other blast posts on the forum and am already doing most of the possible suggestions in there. This is not a very difficult problem, but if somebody already produced a quality solution, I'd rather use it than hack something ugly to solve it.
How about this GNU Parallel - parallelize serial command line programs without changing them imo the best tutorial on biostar at the moment. Also shows how to split records.
Thank you Michael. I didn't remember a blast example in there. I will certainly try that. Want to add this comment as an answer? If it works well for me I'll be able to mark it as the correct answer.