Splitting query fasta file for Diamond Blastp make the process faster?
1
0
Entering edit mode
6 months ago
kmat • 0

I am working on annotating amino acid sequences obtained from metagenomic/transcriptomic assemblies. I am using these sequences as queries and running Diamond Blastp against KEGG genes. However, even after 5 days with 96 threads, the process hasn't concluded.

Firstly, is it common for diamond blastp to take this long? The queries consist of 3.6 million amino acid sequences ranging from 20 to 13813 in length (average 212), against a database of about 35 million sequences.

Would it be appropriate to split the query fasta file and run Diamond Blastp separately in distinct jobs? Could this make the Diamond blastp finish earlier?

I appreciate any insights or recommendations you may have.

Thank you sincerely.

DIAMOND • 451 views
ADD COMMENT
0
Entering edit mode
6 months ago
Mensur Dlakic ★ 28k

There are several pieces of information we are missing here. Did you specify that all 96 threads be used? If so, are all of them being used when you look at the system load? What type of disk do you have: a regular hard disk or a solid state disk? Is it internal or external?

The last two questions are about the speed of reading and writing, which may be a limiting factor here. If you have a solid state disk and let's say only half the threads are engaged, splitting your sequence in two might speed things up. I caution you against trying to split into too many parts. Even with the fastest disk, especially if doing 5+ operations of both type simultaneously, the speed of reading and writing will become a choking point.

It also matters whether you are writing full alignments or just scores in tabular format, the latter being much faster. It also matters how many top scores are being recorded - it takes longer to calculate and write 500 alignments or 500 tabular scores than only 5 of them.

Here is something you should keep in mind. Let's say that an average search time is 1 second per query. A sequence that is 20 residues long might take less, but a sequence >10,000 residues will take much longer. Under that assumption it would take you 1000 hours to do this search, which means that your search will conclude at the end of June. Even if it takes only half a second per sequence, it will still take another two weeks.

Several suggestions: 1) create a non-redundant database of your queries at 90% identity, which might cut the size in almost half; 2) do the same for your target database; 3) get an access to a cluster where you can run this search on multiple nodes, in which case you can split your sequence into 10 parts because they will be using different threads and different disks; 4) ask that only a minimal informative number of scores/alignments be displayed in the output (if you are interested only in the top 5 hits, then ask only for top 5 to be displayed).

If you can't do any of the above, I suggest you find something else to occupy your attention for the next month or so.

ADD COMMENT
0
Entering edit mode

Thank you for your quick responce.

Sorry for the lack of explanation. Here are some supplementary details regarding the parameters of Diamond Blastp.We've specified the following options:

--threads 96 --query-cover 70 --min-score 100 --max-target-seqs 1 --outfmt 6

Furthermore, the job was submitted using Slurm with sbatch -c 96.

ADD REPLY
0
Entering edit mode

I used seqkit split to divide the query FASTA file into 10 parts and then ran diamond blastp on each file using 44 threads. The process finished in about 2 hours. I'm not entirely sure why it was so fast, but I wanted to share this. (Perhaps there was some issue when handling a large number of queries at once previously?) Thank you.

ADD REPLY

Login before adding your answer.

Traffic: 2288 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6