Question

Blast duration time (+500 000 seqs)

0

Entering edit mode

8.3 years ago

eme1309 ▴ 10

Hi!

I am studying bioinformatics and actually am working on a validation project. We are using the command line version of blast. We created a database of the 500 000 sequences we received (using makeblastdb) and wanted to run a blast using 500 000 sequences as query. Well. I've started that yesterday at 2 pm, and... It's still running! I was wondering, maybe we did something wrong? Is it normal for it to run so long, could I do something (appart from setting the evalue, and changing the number of threads) to speed things up? I've checked on my processor something is being done. The problem is I have no way of checking how much time is left.

Thanks in advance

blast blastcmd running time • 8.9k views

ADD COMMENT • link 8.3 years ago by eme1309 ▴ 10

0

Entering edit mode

Thank you for your quick answer... I am using a t430 lenovo( i5-3320M, 2.60GHz, 8GB RAM). Actually, the NCBI process is using about 19 000 kB of memory. These are protein sequences, I am running a blastp, all have about 200-300 aminoacids... We've runned a few tests, based on them, it seems like it should take about 48 days to run...

As I said, this is a validation project. The goal is to verify if, in the history of evolution, there were episodes of reversion (from MAGDA to ADGAM, for example). We are bound to use makeblastdb, so we used the initial sequences to create a database, and wanted to blast the reversed one to this database. As we have two weeks to do it, I guess we should look for another way.

Thanks again for your help, we'll definitly check out what you've sent and try a different approach.

ADD REPLY • link 8.3 years ago by eme1309 ▴ 10

0

Entering edit mode

This sounds like a small desktop computer, how many cores do you have, and are you using them all? I am not sure if Blast is the best tool to detect small reversed pattern in sequences, without mismatches btw? If the minimum length of such pattern is 5 like for MAGDA and you don't look for mismatches, then you could increase the word size to 5 (I think default is 3) which would make blast run faster.

I would look for a tool that is specifically for this task. Otherwise, you are not strictly bound to blast, because you can always dump the sequences from the blastdb into a fasta. Maybe Diamond even accepts blast dbs.

ADD REPLY • link 8.3 years ago by Michael 56k

0

Entering edit mode

I have 2 cores and have no idea if I am using them all. I guess not, sorry, I don't know how to check that.

I am at the moment researching such tools.The problem is, based on our instructions, we have to create a database and use the program blast. About Diamond, I've checked, but it is a replacement tool for a blastx, and our problem is more of a protein-protein one, no?

ADD REPLY • link 8.3 years ago by eme1309 ▴ 10

1

Entering edit mode

DIAMOND can do protein-protein alignment too, and it's much faster than BLAST. You should give it a try.

ADD REPLY • link 8.3 years ago by buchfink ▴ 250

0

Entering edit mode

I am actually trying to install diamond, thank you ;)

ADD REPLY • link 8.3 years ago by eme1309 ▴ 10

0

Entering edit mode

8.3 years ago

Michael 56k

Well, if this some sort of assignment, then we are missing something. It is not possible to complete the 500k^2 blast searches on such a small computer in 14 days. I would be interested in what your supervisor says about this, maybe point them to this thread?

Possibly you can reduce the number of comparisons, possibly to optimize your approach. It is unlikely that all the 500k proteins are orthologs. If you are looking for an evolutionary event, one would normally restrict the search to orthologous groups, because otherwise the finding would be meaningless (e.g. find a short inverted pattern from Dnaa in a ribosomal protein). Anyway, using the full comparison, any small inverted pattern would have ridiculously large e-values (>1).

If you have detected orthologous groups already, you could exploit this information to reduce your search space. Say each orthologous group contains 1000 genes on average, then you would have to do only 500 times searches of 1000 vs. 1000 sequences, which is much more manageable.

ADD COMMENT • link 8.3 years ago by Michael 56k

1

Entering edit mode

Hey!

We've presented the results of the diamond search, which we then crossed with blast (we selected the sequences diamond produced a hit on and ran them through blast.) Our supervisor said he didn't mean to compare those proteins against themselves, unturned - we could use another database, as swiss prot. Also, he recommended using U-search, but was relatively content with the fact we tried and did something anyway.

Again, thanks a lot for your help. We 've learned a lot. Have a nice day

ADD REPLY • link 8.3 years ago by eme1309 ▴ 10

0

Entering edit mode

Ooh, I didn't think about grouping by orthologs! This could actually work, with a more manageable search time. I'll try that now.

I've contacted our supervisor, but I'm not sure if he'll answer. Anyway, thanks a lot, this seems like a good way

ADD REPLY • link 8.3 years ago by eme1309 ▴ 10

score 2 · Accepted Answer · 2017-05-28

Well. I've started that yesterday at 2 pm, and... It's still running! I was wondering, maybe we did something wrong?

I have no information about your setup, the number of processors and the nature of the sequences, you should give us these details though I have to disappoint you drastically. Blast runs of this size typically have a running time of weeks or even months. There are several threads here on estimating the running time, they normally go like this: take a subset (1001) of your sequences and run them with the same parameters against the same database. Then you get a rough estimate of how long it is going to take. If you try a run using only a single short sequence, you will get the upstart time required to load the database in addition. Use the system command time to measure run time. Assuming that run time scales linearly with number of queries and all queries take equal amount of time, which is not necessarily so you can calculate some rough estimate like:

rough estimate := time(1)+((time(1001)-time(1)) * N/1000) ~ time(1000) * N/1000

There are also unsurprisingly many threads here about acceleration of blast runs by using GNU parallel vs. Blast+ multiple threads (Istvan made this comparison, BLAST: Is there a difference between splitting queries and using more threads? ), and threads about faster alternatives to blast like Faster BLAST alternative , KLAST, a Blast-like tool for fast sequence similarity searches: free academic version