Question

BLAST a very large amount of data

0

Entering edit mode

8.7 years ago

Pragy30 ▴ 10

Hi, I am a newbie and I have a very stupid question on my mind. I have ~50000 sequences that I need to BLAST against nr database. Does BLAST 2.2.3 comes with a functionality to pick up the job from where it left off as I am running it on a cluster with limited hours for a job? Or if there's a better way to perform this task, then please also suggest. (I'm trying to keep my sequences together for blast and not break into several small tasks).

blast • 4.2k views

ADD COMMENT • link updated 8.7 years ago by seta ★ 1.9k • written 8.7 years ago by Pragy30 ▴ 10

2

Entering edit mode

Why do you want to keep them in one go?? By separating the job, you can have increased speed (multiple run at the same time) and it can help you to solve your problem

ADD REPLY • link 8.7 years ago by Sam ★ 4.8k

0

Entering edit mode

I am trying to get the output in xml format which needs to be used further down the line. And I am not sure if breaking down the sequences now and then later concatenating the xml files will be of help.

ADD REPLY • link 8.7 years ago by Pragy30 ▴ 10

1

Entering edit mode

I think it should be possible to concatenated xml files to have them look like they were generated in a single run. If this is necessary depends on what you are going to do, I think the BioPerl parser will work with a concatenated xml file, also the Tripal importer works with multiple xml files.

ADD REPLY • link 8.7 years ago by Michael 55k

score 2 · Answer 1 · 2016-03-14

Hi, this is not a stupid question, everyone with a transcriptome assembly will eventually want to try the same, and struggle to estimate the required CPU time for a cluster job, just to experience the job got killed on the last 10 %. The problem is, there is no way that blast easily can pick up the job once it got killed, and in many clusters this is what happens with jobs that overstay. In fact one cannot even be sure that the output generated until this moment is valid (most problems possibly with XML and ASN output, least problems with tabular and default text format).

One way is to try to estimate the required runtime carefully, e.g. by running a 1000 sequences and then extrapolating + a good overhead.

Another, is to try to use standard output and check the last processed query sequence if job is killed, remove all processed queries from the input file with a script, then run again and concatenate the output. BioPerl will have no problem reading that, and if, inserting some text blocks should repair it.

The possibly best solution is to use GNU parallel: Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them It can deal with clusters, and will also split the data for you.

score 0 · Answer 2 · 2016-03-14

Hi friend,

Nowadays, 50000 sequence is not really large, so don't worry about it. As Sam mentioned, splitting fasta file would be increased the speed. however, please consider below note:

adjust -max_target_seq 1
more strict e-value -evalue 1e-10
put -num_alignment 1
tabular format -outfmt 6

if you have a nucleotide sequence file, please try to translate it and use blastp instead of blastx. blastp is much faster than blastx.

hope it does help.