Question

Splitting query fasta file for Diamond Blastp make the process faster?

0

Entering edit mode

6 months ago

kmat • 0

I am working on annotating amino acid sequences obtained from metagenomic/transcriptomic assemblies. I am using these sequences as queries and running Diamond Blastp against KEGG genes. However, even after 5 days with 96 threads, the process hasn't concluded.

Firstly, is it common for diamond blastp to take this long? The queries consist of 3.6 million amino acid sequences ranging from 20 to 13813 in length (average 212), against a database of about 35 million sequences.

Would it be appropriate to split the query fasta file and run Diamond Blastp separately in distinct jobs? Could this make the Diamond blastp finish earlier?

I appreciate any insights or recommendations you may have.

Thank you sincerely.

DIAMOND • 446 views

ADD COMMENT • link 6 months ago by kmat • 0

score 0 · Answer 1 · 2024-05-22

There are several pieces of information we are missing here. Did you specify that all 96 threads be used? If so, are all of them being used when you look at the system load? What type of disk do you have: a regular hard disk or a solid state disk? Is it internal or external?

The last two questions are about the speed of reading and writing, which may be a limiting factor here. If you have a solid state disk and let's say only half the threads are engaged, splitting your sequence in two might speed things up. I caution you against trying to split into too many parts. Even with the fastest disk, especially if doing 5+ operations of both type simultaneously, the speed of reading and writing will become a choking point.

It also matters whether you are writing full alignments or just scores in tabular format, the latter being much faster. It also matters how many top scores are being recorded - it takes longer to calculate and write 500 alignments or 500 tabular scores than only 5 of them.

Here is something you should keep in mind. Let's say that an average search time is 1 second per query. A sequence that is 20 residues long might take less, but a sequence >10,000 residues will take much longer. Under that assumption it would take you 1000 hours to do this search, which means that your search will conclude at the end of June. Even if it takes only half a second per sequence, it will still take another two weeks.

Several suggestions: 1) create a non-redundant database of your queries at 90% identity, which might cut the size in almost half; 2) do the same for your target database; 3) get an access to a cluster where you can run this search on multiple nodes, in which case you can split your sequence into 10 parts because they will be using different threads and different disks; 4) ask that only a minimal informative number of scores/alignments be displayed in the output (if you are interested only in the top 5 hits, then ask only for top 5 to be displayed).

If you can't do any of the above, I suggest you find something else to occupy your attention for the next month or so.