Hi all,
I'm facing very low CPU usage with blastp 2.14.0 on a virtual machine with 184 GB of RAM and 32 cores. The blast searches I'm running seem to take unusually long time. This is my typical commandline:
blastp -task blastp -db nr -query query_sequence.fasta -num_threads 32 -max_target_seqs 200000 -outfmt 15 -out blast_output.json
I noticed that when the search is launched there are multiple blastp threads populating all the 32 cores as expected, but this lasts just for a few minutes. Then, just 1 thread survives and it hangs there at very low CPU load for many hours (just 3-5% average CPU usage on the single core). This thread uses up to 85% of the RAM.
Is it normal that the CPU load is so low for hours?
Actually I very often cluster the search results at 90-95% identity depending on how many hits I get. Makes indeed total sense to just move to a clustered nr database and use it instead, or at least switch to it when the entire sequence space of a query sequence can't be explored without having to raise
-max_target_seqs
too much (which is basically what slows down the process). I was waiting for the mmseqs-clustered version NCBI offers through the Blast web interface (should be the one you're referring to), but for some reason it's still flagged as "experimental" and it's not available for download.The only drawback I see with using a clustered nr is the risk of missing some PDB codes, but this can be easily worked around by running a parallel search on the much smaller pdbaa database.
I will try uniref90 out. Many thanks for the hint!