Question

Script To Submit And Manage Blast Jobs On A Cluster?

4

Entering edit mode

13.5 years ago

Shellfishgene ▴ 310

Hi,

I make use of a local cluster (managed through LSF) to BLAST large sequence sets. I wrote a basic script to split a fasta file and submit each part as a different BLAST job. Does anyone know of such a script that is more advanced, and does things such as merge the ouput or restart failed jobs?

I could extend mine to do that, but I have this feeling that I would be reinventing the wheel.

And btw we have mpiBLAST installed, but I think it's not actually any faster than splitting the input file.

blast clustering • 6.9k views

ADD COMMENT • link updated 13.5 years ago by Ahdf-Lell-Kocks ★ 1.6k • written 13.5 years ago by Shellfishgene ▴ 310

Ram · Answer 1 · 2012-01-31

2

Entering edit mode

13.5 years ago

Schrodinger'S Cat ▴ 210

Not sure I would merge the the results but rather parse each XML file separately which is much faster. see this for XML parsing using XMLstarlet XSLT:

if you must merge you can simply use CAT.

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 13.5 years ago by Schrodinger'S Cat ▴ 210

score 1 · Answer 2 · 2012-01-31

1

Entering edit mode

13.5 years ago

Manu Prestat 4.1k

Paracel does what you want, but is (really) not free. I did a similar python script as yours that divides the input (I agree with you on this point) and uses MPI, necessary for dealing with several nodes, I can share if you need.

ADD COMMENT • link 13.5 years ago by Manu Prestat 4.1k

0

Entering edit mode

Paracel seems like the thing I was looking for, but I doubt we'll want to pay for it. It would be great if you could share your script. Mine is really simple, it just divides and submits jobs to LSF. I'm not sure how yours involves MPI?

ADD REPLY • link 13.5 years ago by Shellfishgene ▴ 310

0

Entering edit mode

ok, contact me, I will help you

ADD REPLY • link 13.5 years ago by Manu Prestat 4.1k

Ram · Answer 3 · 2012-02-08

An option is to use eHive, which is free and open source:
http://www.biomedcentral.com/1471-2105/11/240

The processing of the jobs can go from the very simple list of commands to the very complex pipelining, like the ones used in Ensembl and other projects out there. A simple example of command line piping into a queueing system, with fail tolerance, resource management (num. CPUs, memory, etc), all in one script is here:

ensembl-hive/scripts/cmd_hive.pl

also have a look at InputFile_SystemCmd:

init_pipeline.pl Bio::EnsEMBL::Hive::PipeConfig::InputFile_SystemCmd_conf -ensembl_cvs_root_dir $HOME $dbdetails -inputfile very_long_list_of_blast_jobs.txt
beekeeper.pl -url $dburl -loop

There are a few Perl dependencies to get it working, and then the backend can be a no-frills simple sqlite which will work fine for tens to few hundreds of concurrent jobs, or a MySQL backend that usually works well for hundreds to close to a thousand concurrent jobs.

LSF support comes out of the box in eHive. There is also support for some other queueing systems, like SGE. The same script that you use in your farm you can test first in your workstation without the need of a queueing system, just using the '-local' option.

score 0 · Answer 4 · 2012-01-31

0

Entering edit mode

13.5 years ago

Gorysko ▴ 100

If I'm correct blast2 has key "-a" were you could indicate how many processors You want to use

ADD COMMENT • link 13.5 years ago by Gorysko ▴ 100

2

Entering edit mode

You're correct, but the improvement of speed is far from being linear with the number of cpu/cores. Partitioning data and launching separated jobs is much more efficient.

ADD REPLY • link 13.5 years ago by Manu Prestat 4.1k

0

Entering edit mode

you're correct, but the improvement of speed is far from being linear with the number of cpu/cores. Partitioning data and launching separate d jobs is much more efficient.

ADD REPLY • link 13.5 years ago by Manu Prestat 4.1k

0

Entering edit mode

Also if I'd use -num_threads 8 I'd have to get a full node on the cluster for each job, they spend more time in the queue then it seems.

ADD REPLY • link 13.5 years ago by Shellfishgene ▴ 310

0

Entering edit mode

I am not too familiar LSF, but in some batch-job management systems jobs requiring more CPU cores will get lower priority so dividing the query into many instances without using the multithread flag is faster.

ADD REPLY • link 13.5 years ago by Schrodinger'S Cat ▴ 210

Ram · Answer 5 · 2012-02-01

0

Entering edit mode

13.5 years ago

Yannick Wurm ★ 2.5k

Do you absolutely need to get xml output? In xml ouput the local alignments are necessarily calculated (this was the case in legacy BLAST - I don't know if this has changed with Blast+). But calculating them is slow.

So you may be able to dramatically accelerate things by using table output. See also A: Is Blast+ Running As Fast As It Could ?

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 13.5 years ago by Yannick Wurm ★ 2.5k

0

Entering edit mode

I think local alignments are computed anyway, if not, how does work the scoring function?